lsort , a built-in Tcl command, sorts elements in a list.
lsort returns a new list in sorted order. It uses a the merge-sort algorithm, which is a stable sort that has O(n log n) performance characteristics. By default, sorting is done in ASCII ascending order. The following options may be specified to control the sorting process:
lsort -integer -index 1 {{First 24} {Second 18} {Third 30}}
returns {Second 18} {First 24} {Third 30}. This option is much more efficient than using -command to achieve the same effect. (From: TclHelp)
RS: WARNING: lsort -dictionary can not be used for sensibly sorting signed integers. It returns first the negatives, then zero and the positives, but both in ascending absolute value order:
% lsort -dic {-5 -10 -1 0 3 1 2} -1 -5 -10 0 1 2 3
So for a numeric maximum function, use the numeric qualifiers -integer or -real (which does decimal integers as well, just might take some longer).
lsort -real cannot sort non-decimal (i.e. hex or octal) integers, and lsort -integer cannot sort any number containing a decimal point or exponent (which cannot, in any case, be mixed with a hex specifier.)
willdye: Perhaps this behavior comes from trying to accomodate strings which use dashes as a general separator. For example, here's a list of dates in the widely-used ISO8601 (YYYY-MM-DD) format:
% lsort -dictionary {2005-01-01 2005-01-02 2005-01-03 2005-01-04} 2005-01-01 2005-01-02 2005-01-03 2005-01-04
(See Parsing ISO8601 dates and times if you want details about the format). I agree with RS, however, that the list -5 -10 -1 0 3 1 2 should ideally sort to -10 -5 -1 0 1 2 3. Fortunately, the behavior is easy to work around.
KBK 2001-02-14:
One thing to remember when using lsort is that the performance of lsort -command is pretty awful. If you can pre-convert your list into something that can be sorted with another form of lsort, it's almost always faster to do so.
Example: Over in Things British, RS posted code to sort a list of surnames according to the rules for a British telephone directory. The code worked something like:
proc comp.uk.phone {x y} { regsub {Ma?c ?([A-Z])} $x {M_\1} x regsub {Ma?c ?([A-Z])} $y {M_\1} y string compare $x $y } proc sort.uk.phone { list } { return [lsort -command comp.uk.phone $list] }
It's a good bit faster to sort the list by making a list of sort key and value, sorting this list with lsort -index, and then extracting just the values, thus:
proc sort.uk.phone2 { list } { foreach name $list { regsub {Ma?c ?([A-Z])} $name {M_\1} key lappend list2 [list $key $name] } foreach pair [lsort -index 0 -ascii $list2] { lappend list3 [lindex $pair 1] } return $list3 }
OK, let's run the two methods head to head:
set list {MacDonald McArthur McEwan Lyttle Mabbs Jones} set l 6 puts " Length | Method 1 | Method 2 | Ratio (m1 / m2)" puts " -------+----------+----------+----------------" while {$l <= 25000} { if {$l > 100} { set count 1 } else { set count 100 } set m1 [lindex [time {sort.uk.phone $list} $count] 0] set m2 [lindex [time {sort.uk.phone2 $list} $count] 0] set ratio [expr {double($m1) / double($m2)}] puts [format " %6d | %8d | %8d | %9.2f" $l $m1 $m2 $ratio] incr l $l lappend list {*}$list }
The results are shown below:
Length | Method 1 | Method 2 | Ratio (m1 / m2) -------+----------+----------+---------------- 6 | 1410 | 500 | 2.82 12 | 4200 | 1000 | 4.20 24 | 10820 | 1900 | 5.69 48 | 27140 | 3710 | 7.32 96 | 65890 | 7310 | 9.01 192 | 160000 | 10000 | 16.00 384 | 361000 | 30000 | 12.03 768 | 771000 | 60000 | 12.85 1536 | 1783000 | 120000 | 14.86 3072 | 3805000 | 251000 | 15.16 6144 | 8352000 | 571000 | 14.63 12288 | 18166000 | 1061000 | 17.12 24576 | 39077000 | 2153000 | 18.15
DKF This technique is (I believe) known as creating a collation key. Also note that this comparison is not entirely fair, as it is performing additional regsub'''s in the -command case, though it is easy to see from a comparison of lsort $someList with lsort -command {string compare} $someList that it is still painful to use -command when you don't have to.
- See Custom sorting on a wrapper for collation key sorting.
How does one sort a multi-index? By that is meant something like, "sort by last name, then first name, then age". lsort doesn't offer direct support for multi-indexes. The straightforward approach, therefore, is to use -command. However, Bruce Hartweg wisely observes that, as lsort sorts stably, an efficient multi-index sort is achieved by cascading lsort applications--least significant first, of course. See Additional list functions for multisort which packages such cascaded sorts into one call.
In the same vein, Andreas Kupries wrote on 2001-03-21 in comp.lang.tcl:
The lsort command uses a 'stable' mergesort. This means that it is possible to sort something after more than one key and the later sorts will not destroy the order of the previous sorts/keys. Quicksort for example is not stable (DKF: it also has some rotten worst-case behaviour, unlike mergesort.)
So, just sort after the secondary key first and then after the primary key. like so:
set f {{12 11} {12 13} {12 12} {11 14} {13 12}} puts [lsort -index 1 $f] puts [lsort -index 0 [lsort -index 1 $f]]
{12 11} {12 12} {13 12} {12 13} {11 14} {11 14} {12 11} {12 12} {12 13} {13 12}
DKF: If you are doing this on a lot of data, it is far more efficient to convert the keys into an additional collation key (as described above in relation to performance-measuring) and sort on that, as this adds two O(n) (i.e. linear) operations to generate and strip the collation keys and removes an O(n log n) sort.
KBK: Custom sorting has a worked example of doing what DKF recommends above.
JRW: Both approaches are asymptotically equivalent. If you are trying to sort records on k fields, then either you do one sort where each sort comparison involves comparing k fields, or you do k (stable) sort passes where each comparison involves one field. Either way, k n log n comparisons. (KBK But the constant factor for lsort -command is horrible, see above. Do everything that you can to avoid using lsort -command, and you will be happier in the long run.)
RS:
One thing lsort -command is good for is to introspect how lsort works. Consider this silly sort function:
proc foo {a b} {puts [list $a $b]; return 0}
which constantly says its inputs are equal, but logs them to stdout:
lsort -command foo {1 2 3 4 5 6 7 8 9 10}
1 2 3 4 1 3 2 3 5 6 7 8 5 7 6 7 1 5 2 5 3 5 4 5 9 10 1 9 2 9 3 9 4 9 5 9 6 9 7 9 8 9
Makes it pretty clear how a merge sort works, doesn't it?
FW: Why is -command so slow, anyway?
NEM 2009-03-22: I profiled this at one point. The overhead comes partly from calling a command: looking it up, checking for any command/execution traces, etc. You can eliminate some of this, but not all. Another big killer is simply having to pack/unpack the comparison result (-1,0,1) into a Tcl_Obj. That puts some allocs/frees into the critical loop. Making this run fast would probably require some heavy-duty re-engineering of large parts of how Tcl deals with values and commands.
bll 2016-1-10: See collation for information on how to sort in the locale's collation sequence.
LV 2003-06-02:
Does lsort support l10n considerations - such as sorting \u0063 equivalent to \u00e7 - at least in a French locale?
RS: No, sorry. One would have to introduce custom sorting explicitly depending on requirements (see e.g. Things British for British phone-book sorting).
HolgerJ 2003-10-13: I'm still amazed by the fact that lsort -dictionary does not sort by Unicode dictionary order. I know that the algorithm is quite complex and language dependent (one would have to read the environment to find out the language), but at least the most common things could be considered - like Umlauts and French accents. lsort -dictionary should be renamed lsort -ignorecase or the like.
So, if you want to sort according to dictionary rules, you have to supply your own strcmp-like function or re-implement the whole lsort command. The latter might be faster, but not so convenient.
RS Is there a Unicode dictionary order, except for the numeric codes (which lsort does)? As far as I know, different languages have different collation orders, e.g. "Ä" might be sorted just as A, or as AE, in German, while in Swedish it sorts after Å... which sorts after Z...
HolgerJ 2003-10-19: No, see above. "Ä" does principally sort like "A" in German, but the trema (dots) are used as a tiebreaker. This is called secondary sorting order. For details see http://www.unicode.org/unicode/reports/tr10/ Quotation: 'This is worth emphasizing: the sorting weight of characters is not provided by their position in the Unicode code charts. ' Hopefully, this will be implemented in C the Tcl core. In Tcl it would bee too slow, I think.
The following idea of Lars H. looks fine, but lacks support for the secondary (and tertiary) order.
Lars H: Whereas I agree the Unicode support in Tcl is incomplete, I don't agree on the particular example (dated 2003-10-13 by Holger J above). The Unicode Standard [L1 ] is littered with remarks that issues such as this should not be handled by Unicode as such, but by a higher level mechanism. One could have an option to lsort that specifies a command that should be applied to the sort keys before they are compared, or perhaps a string map style list to normalize the characters, but the Tcl core has no business implementing language-specific sorting as such. Also, -dictionary does far more than the old (Tcl7) -ignore option of lsort did (e.g., it places a2 before a10).
Currently, this is what you should do to get language-specific sorting
set temp_list [list] foreach item $list_to_sort { lappend temp_list [list $item [string map $collateList $item]] # $collateList is what handles language-dependence } set sorted_list [list] foreach item [lsort -dictionary -index 1 $temp_list] { lappend sorted_list [lindex $item 0] }
$sorted_list is now sorted in the custom order. For Swedish, the collateList should probably be something on the lines of
set collateList {å \x80 Å \x80 ä \x81 Ä \x81 ö \x82 Ö \x82 W V w v é e É E}
I need to do something special for ÅÄÖåäö, because the native order in latin-1 (and hence in Unicode) is wrong for Swedish.
What I miss most in Tcl's Unicode support is some way of normalizing strings (at least to some of the standard normalisation forms defined in the standard and its annexes). decomposed and precomposed subcommands of string appear to me as the most obvious way to do this.
LV: If someone is looking for a project, research http://www.unicode.org/reports/tr10/tr10-10.html (or whatever variation is the latest of the document) - this is a technical note by the Unicode folks on the fun of collation of unicode strings. A Tcl extension providing an interface to the algorithm would be a wonderful contribution.
AMG: If you want to modify a sorted list in-place and keep the result sorted, check out Sorted Lists for alternatives to repeatedly calling lsort.
KBK 2005-03-29 in the Tcl chatroom: Since Tcl's merge exploits natural runs, you can also merge two lists without an extra merge command to do it:
lsort [concat $list1 $list2]
works in O(n) time if $list1 and $list2 are already sorted.
MG 2005-07-29: I have a list, where each element is itself a two-element list, and I want to sort based on both elements, one after the other. It seems that the -index option to lsort can only take a single index, though - I'd been thinking of trying something like this
% lsort -index {1 0} [list {f 5} {a 5} {b 3}] {b 3} {a 5} {f 5}
but it raises an error. Should I just use a custom proc and the -command option, or is there a better way that I'm not aware of? Thanks in advance for your help :)
Peter Newman 2005-07-29: Search for multisort above in this page and in Additional list functions.
Note that before Tcl 8.5, some argument combinations were not correct, but were not reported as an error. For example
lsort -integer numlist
which is a mistake in a couple of ways, generates an error now where in the past the fact that no integer arguments were provided was not raised as an error.
AMG: Here's how to sort a list by ascending string length. It demonstrates the use of [apply] for lambdas.
set data {this is tcl} lsort -command {apply {{a b} {expr {[string length $a] - [string length $b]}}}} $data # returns "is tcl this"
AMG: The Tcl 8.6b2 man page for lsort doesn't clearly document what exactly happens when -unique and -indices are used together. Here's how it works: the index of the last duplicate element is returned. The man page does make mention of this behavior in connection to custom comparisons (the example used is -index 0), but you have to extrapolate from there to guess that lsort returns the index of the last duplicate when -indices is used. When neither a custom comparison nor -indices are used, it completely does not matter which duplicate is returned.
AMG: To check if a list contains duplicates, try:
if {[llength $list] != [llength [lsort -unique $list]]} { puts "list contains duplicates" }
The number of duplicates is equal to the difference in list lengths, since [lsort -unique] removes duplicates.
Various [lsort] switches can be used to define equality. For example, -integer would cause "0" and "-0" to compare equal and therefore be considered as duplicates.
Actually getting a list of duplicate keys isn't so straightforward. [lsort] doesn't say what it's removing, only what it's not, so you have to subtract its result from its input. That then gives the list of duplicates. However, what if a key appears more than two times? It'll show up multiple times in the duplicate list. Cleaning that up takes another call to [lsort -unique].
Give this code a try. The idea is it returns a list containing the first (lowest-indexed) instance of each duplicate word, where equality is defined according to the usual [lsort] switches. Just don't pass it -indices, -stride, or -unique. More work is needed to support -indices and -stride.
proc duplicates {args} { set list [lsort [lindex $args end]] foreach word [lsort -unique {*}$args] { if {[set match [lsearch -all -sorted $list $word]] ne {}} { set list [lreplace $list [lindex $match end] [lindex $match end]] } } lsort -unique {*}[lrange $args 0 end-1] $list }
If you think this can be done more simply, don't forget about all the comparison switches that [lsort] has.
Because [lsort] doesn't have enough switches already (that's a joke), perhaps a -duplicate switch could be added to keep only the duplicate words. In combination with -unique, only the first instance of each word would be reported. I don't think this new switch would conflict with any of the existing switches.