`sort` for Testers
Sometimes you're using sort
. Sometimes you're testing it.
Machines love to iterate: if they expect to be walking through sorted data, they can be thrown by something out of place.
sort
as a tool
sort
sorts stuff, which seems handy 0n its own – but organising information opens doors of its own.
- comparing two sets of data is relatively trivial if they're sorted (in the same way), painful (and slow) if not.
- finding unique values in data is trivial if the data is sorted, harder if not.
- unique values can be aggregated into counts, sums and statistics
- knowing the unique values lets one look at boundaries, and see outliers
- putting ranges in order lets one see where change is smooth, and where it is lumpy
- sorting can reveal ways that values across several fields go together.
- It's easy to pick out not only the smallest and largest, but the several smallest or largest.
- One can prioritise processing by oldest or newest, furthest or closest, largest or smallest, or by any expressible value. You might lump together similar things, or choose to work in a way that distributes them throughout a job.
sort
as a target
As a tester, you hardly ever test whether a sort sorts stuff; the algorithms are well understood and typically not implemented in code that one can change. Indeed, they're often deep enough that testers hardly think of them – imagine file search, databases or screen rendering.
You certainly need to test the choice of sort and whether the data suits it. You'll test the performance of sort, looking for inflexions that indicate system constraints or novel problems with data distribution. You'll test the knock-on effects of sort
Sort organises stuff: organised stuff lets you analyse and aggregate.
Got a directory listing and want the most-recent? sort -k6M -k7n
. Want the largest? sort -k5hr
Got several logs to look through?
Want to see the unique accountID
s?
You can sort to see the unique values
Handy to know
Sort doesn't necessarily expect the same things as you. It has options to help cope.
So in sort -k6M -k7n
above,
Weirdnesses
Unix sort
defaults to character sort – give it numbers 1-100, and 10
and 100
will follow 1
, and 11
will be next. We've all seen this in address sorting. I saw a phone company bill calls wrongly because they pulled batches of files off the top of a list. That was fine when they had a few files, but when the switches wrote double-or triple-digit names, the picker left the older files in place – some until they were so old that the rater rejected the records.
Where two things are sorted by the same method, they can be compared – but that doesn't mean that the sorted data is in the right order for use. I've seen an accounting system happily and correctly match two collections for years. When one of the collections needed to be used for something else, it was clear that it was in the wrong order – and was unusable in its
Sort doesn't sort in the order you expect. Apparently, if
the environment has LANG set to en_US.UTF-8
... sort appears broken because case is folded and punctuation is ignored because ‘en_US.UTF-8’ specifies this behavior
Demonstration to follow.
sort -R
doesn't sort in random order, but by the hash of the line. It's deterministic, but doesn't repeat, and I don't understand randomness clearly-enough to demonstrate. I understand that you should use shuf
instead –if you've got it.
Comments
Sign in or become a Workroom Productions member to read and leave comments.