lsort

Difference between version 71 and 72 - Previous - Next
'''[http://www.tcl.tk/man/tcl/TclCmd/lsort.htm%|%lsort]''', a [Tcl Commands%|%built-in] [Tcl] [command], sorts elements in a list.



** Synopsis **

    :   '''lsort''' ?''options...''? ''list''



** Documentation **

   [http://www.tcl.tk/man/tcl/TclCmd/lsort.htm%|%official reference]:   

   [TIP] [http://www.tcl.tk/cgi-bin/tct/tip/326.html%|%326], Add `-stride` Option to lsort:   New in version [Changes in Tcl/Tk 8.6%|%8.6]. 

   [TIP] [http://www.tcl.tk/cgi-bin/tct/tip/241%|%241], Case-Insensitive Switches and List Searching and Sorting:   New in version [Changes in Tcl/Tk 8.5%|%8.5]

   [TIP] [http://www.tcl.tk/cgi-bin/tct/tip/217.html%|%217], Getting Sorted Indices out of Lsort:   New in version [Changes in Tcl/Tk 8.5%|%8.5].



** Description **


'''lsort''' returns a new list in sorted order.  It uses a the merge-sort
algorithm, which is a stable sort that has O(n log n) performance
characteristics.  By default, sorting is done in [ASCII] ascending order.
The following options may be specified to control the sorting process: 

   '''`-ascii`''':   Use string comparison with [ASCII] collation order.  This is the default. 

   '''`-command`''' ''command'':   Use ''command'' as a comparison command. To compare two elements, evaluate a Tcl script consisting of command with the two elements appended as additional arguments.  The script should return a 32-bit signed integer less than, equal to, or greater than zero if the first element is to be considered less than, equal to, or greater than the second, respectively. 

   '''`-decreasing`''':   Sort the list in decreasing order ("largest" items first). 
   '''`-dictionary`''':   Use dictionary-style comparison.  This is the same as `-ascii` except (a) case is ignored except as a tie-breaker and (b) if two strings contain embedded numbers, the numbers compare as integers, not characters.  For example, in `-dictionary` mode, bigBoy sorts between `bigbang` and `bigboy`, and `x10y` sorts between `x9y` and `x11y`.  Only decimal numbers are recognized, not hexadecimal or octal. 

   '''`-increasing`''':   Sort the list in increasing order ("smallest" items first). This is the default. 

   '''`-index`''' ''index'':   If this option is specified, each of the elements of list must itself be a well-formed list.  Instead of sorting based on whole sublists, lsort extracts the index'th element from each sublist and sorts based on the that element.  The keyword `end` is allowed for the index to sort on the last sublist element. For example, 

   '''`-integer`''':   Convert list elements to integers and use integer comparison. 

   '''`-real`''':   Convert list elements to floating-point values and use floating comparison. 


======
lsort -integer -index 1 {{First 24} {Second 18} {Third 30}}
======

returns `{Second 18} {First 24} {Third 30}`. This option is much more efficient than using -command to achieve the same effect.  (From: [TclHelp])

   '''`-stride`''' ''groupcount'':   interpret the list groups of lists, each list having ''groupcount'' elements.   This saves the trouble of constructing such a list from a flat list when such semantics are needed.

   '''`-unique`''':   If this option is specified, then only the last set of duplicate elements found in the list will be retained. Note that duplicates are determined relative to the comparison used in the sort. Thus if `-index 0` is used, `{1 a}` and `{1 b}` would be considered duplicates and only the second element, `{1 b}`, would be retained. (From: [TclHelp])



**lsort -dictionary warning**

[RS]: WARNING: `lsort -dictionary` can not be used for sensibly sorting signed integers. It returns first the negatives, then zero and the positives, but '''both in ascending absolute value order''': 

======none
% lsort -dic {-5 -10  -1 0 3 1 2}
-1 -5 -10 0 1 2 3
======

So for a numeric maximum function, use the numeric qualifiers `-integer` or `-real` (which does ''decimal'' integers as well, just might take some longer).

`lsort -real` cannot sort non-decimal (i.e. hex or [Tcl and octal numbers%|%octal]) integers, and `lsort -integer` cannot sort any number containing a decimal point or exponent (which cannot, in any case, be mixed with a hex specifier.)

[willdye]: Perhaps this behavior comes from trying to accomodate strings which use dashes as a general separator.  
For example, here's a list of dates in the widely-used [ISO8601] (YYYY-MM-DD) format:

======none
% lsort -dictionary {2005-01-01 2005-01-02 2005-01-03 2005-01-04}
2005-01-01 2005-01-02 2005-01-03 2005-01-04
======

(See [Parsing ISO8601 dates and times] if you want details about the format).
I agree with [RS], however, that the list `-5 -10 -1 0 3 1 2` should ideally sort to `-10 -5 -1 0 1 2 3`.
Fortunately, the behavior is easy to work around.



** lsort performance **

[KBK] 2001-02-14: 

One thing to remember when using `lsort` is that the performance
of `[lsort] -command` is pretty awful.  If you can pre-convert your
list into something that can be sorted with another form of `lsort`,
it's almost always faster to do so.

Example:  Over in [Things British], [RS] posted
code to sort a list of surnames according to the rules for a
British telephone directory.  The code worked something like:

======
proc comp.uk.phone {x y} {
    regsub {Ma?c ?([A-Z])} $x {M_\1} x
    regsub {Ma?c ?([A-Z])} $y {M_\1} y
    string compare $x $y
}

proc sort.uk.phone { list } {
    return [lsort -command comp.uk.phone $list]
}
======

It's a good bit faster to sort the list by making a list
of sort key and value, sorting this list with `lsort -index`, and
then extracting just the values, thus:

======
proc sort.uk.phone2 { list } {
    foreach name $list {
        regsub {Ma?c ?([A-Z])} $name {M_\1} key
        lappend list2 [list $key $name]
    }
    foreach pair [lsort -index 0 -ascii $list2] {
        lappend list3 [lindex $pair 1]
    }
    return $list3
}
======

OK, let's run the two methods head to head:

======
set list {MacDonald McArthur McEwan Lyttle Mabbs Jones}
set l 6
puts " Length | Method 1 | Method 2 | Ratio (m1 / m2)"
puts " -------+----------+----------+----------------"
while {$l <= 25000} {
    if {$l > 100} {
        set count 1
    } else {
        set count 100
    }
    set m1 [lindex [time {sort.uk.phone $list} $count] 0]
    set m2 [lindex [time {sort.uk.phone2 $list} $count] 0]
    set ratio [expr {double($m1) / double($m2)}]
    puts [format " %6d | %8d | %8d | %9.2f" $l $m1 $m2 $ratio]
    incr l $l
    lappend list {*}$list
}
======

The results are shown below:

======none
Length | Method 1 | Method 2 | Ratio (m1 / m2)
-------+----------+----------+----------------
     6 |     1410 |      500 |      2.82
    12 |     4200 |     1000 |      4.20
    24 |    10820 |     1900 |      5.69
    48 |    27140 |     3710 |      7.32
    96 |    65890 |     7310 |      9.01
   192 |   160000 |    10000 |     16.00
   384 |   361000 |    30000 |     12.03
   768 |   771000 |    60000 |     12.85
  1536 |  1783000 |   120000 |     14.86
  3072 |  3805000 |   251000 |     15.16
  6144 |  8352000 |   571000 |     14.63
 12288 | 18166000 |  1061000 |     17.12
 24576 | 39077000 |  2153000 |     18.15
======

[DKF] This technique is (I believe) known as creating a collation key.  Also note that this comparison is not entirely fair, as it is performing additional `[regsub]`'''s in the `-command` case, though it is easy to see from a comparison of `lsort $someList` with `lsort -command {string compare} $someList` that it is still painful to use `-command` when you don't have to.

- See [Custom sorting] on a wrapper for collation key sorting.



**Sorting on multiple indexes**

How does one sort a multi-index?  By that is meant something like,
"sort by last name, then first name, then age".  lsort doesn't offer
direct support for multi-indexes.  The straightforward approach, 
therefore, is to use -command.  However,
[Bruce Hartweg] wisely observes that, as lsort sorts ''stably'', an
efficient multi-index sort is achieved by cascading lsort
applications--least significant first, of course. See [Additional list functions] for ''multisort'' which packages such cascaded sorts into one call.

In the same vein, [Andreas Kupries] wrote on 2001-03-21 in [comp.lang.tcl]:

The lsort command uses a 'stable' mergesort. This means that it is
possible to sort something after more than one key and the later sorts
will not destroy the order of the previous sorts/keys. Quicksort for
example is not stable (''DKF:'' it also has some rotten worst-case behaviour, unlike mergesort.)

So, just sort after the secondary key first and then after the primary
key. like so:

======
set f {{12 11} {12 13} {12 12} {11 14} {13 12}}

puts [lsort -index 1 $f]
puts [lsort -index 0 [lsort -index 1 $f]]
======

======
{12 11} {12 12} {13 12} {12 13} {11 14}
{11 14} {12 11} {12 12} {12 13} {13 12}
======

[DKF]: If you are doing this on a lot of data, it is far more efficient to convert the keys into an additional collation key (as described above in relation to performance-measuring) and sort on that, as this adds two ''O(n)'' (i.e. linear) operations to generate and strip the collation keys and removes an ''O(n log n)'' sort.

[KBK]: [Custom sorting] has a worked example of doing what [DKF] recommends above.

[JRW]: Both approaches are asymptotically equivalent.  If you are trying to sort records on ''k'' fields, then either you do one sort where each sort comparison involves comparing ''k'' fields, or you do k (stable) sort passes where each comparison involves one field.  Either way, k n log n comparisons.  ([KBK] But the constant factor for `lsort -command` is horrible, see above. Do everything that you can to avoid using `lsort -command`, and you will be happier in the long run.)

** `lsort -command` **

[RS]:

One thing `lsort -command` is good for is to introspect how lsort works. Consider this silly sort function:

======
proc foo {a b} {puts [list $a $b]; return 0}
======

which constantly says its inputs are equal, but logs them to stdout:

======
lsort -command foo {1 2 3 4 5 6 7 8 9 10}
======
======none
1 2
3 4
1 3
2 3
5 6
7 8
5 7
6 7
1 5
2 5
3 5
4 5
9 10
1 9
2 9
3 9
4 9
5 9
6 9
7 9
8 9
======

Makes it pretty clear how a merge sort works, doesn't it? 

----

[FW]: Why is -command so slow, anyway?

[NEM] 2009-03-22: I profiled this at one point. The overhead comes partly from calling a command: looking it up, checking for any command/execution `[trace%|%traces]`, etc.
You can eliminate some of this, but not all. Another big killer is simply having to pack/unpack the comparison result (-1,0,1) into a Tcl_Obj. That puts some
allocs/frees into the critical loop. Making this run fast would probably require some heavy-duty re-engineering of large parts of how Tcl deals with values and commands.

** `lsort` and  [unicode] **

[bll] 2016-1-10: See [collation] for information on how to sort in the locale's collation sequence.

[LV] 2003-06-02: 

Does lsort support [l10n] considerations - such as sorting \u0063 equivalent
to \u00e7 - at least in a French locale?

[RS]: No, sorry. One would have to introduce [custom sorting] explicitly depending on requirements (see e.g. [Things British] for British phone-book sorting).

----

[HolgerJ] 2003-10-13: I'm still amazed by the fact that `lsort -dictionary` does ''not'' sort by Unicode dictionary order. I know that the
algorithm is quite complex and '''language dependent''' (one would have to read the environment to find out the
language), but at least the most common things could be considered - like
Umlauts and French accents. `lsort -dictionary` should be renamed `lsort -ignorecase` or the like.

So, if you want to sort according to dictionary rules, you have to supply your own strcmp-like function or
re-implement the whole lsort command. The latter might be faster, but not so convenient.

----

[RS] Is there a Unicode dictionary order, except for the numeric codes (which [lsort] does)? As far as I know, different languages have different collation orders, e.g. 
"Ä" might be sorted just as A, or as AE, in German, while in Swedish it sorts after Å... which sorts after Z...

----

[HolgerJ] 2003-10-19: No, see above. "Ä" does principally sort like "A" in German, but the trema (dots) are used as a tiebreaker. This
is called secondary sorting order. For details see http://www.unicode.org/unicode/reports/tr10/
Quotation: 'This is worth emphasizing: the sorting weight of characters is not provided by their position in the Unicode code charts.
'
Hopefully, this will be implemented in C the Tcl core. In Tcl it would bee too slow, I think.

The following idea of [Lars H]. looks fine, but lacks support for the secondary (and tertiary) order.

----

[Lars H]: Whereas I agree the Unicode support in Tcl is incomplete, I don't agree on the particular example 
(dated 2003-10-13 by Holger J above).
The Unicode Standard [http://www.unicode.org/versions/Unicode4.0.0] is littered with remarks that issues 
such as this should ''not'' be handled by Unicode as such, but by a higher level mechanism. One could
have an option to `[lsort]` that specifies a command that should be applied to the sort keys before they are
compared, or perhaps a `[string map]` style list to normalize the characters, but the Tcl core has no business
implementing language-specific sorting as such. Also, `-dictionary` does far more than the old (Tcl7) -ignore 
option of `[lsort]` did (e.g., it places a2 before a10).

Currently, this is what you should do to get language-specific sorting

======
set temp_list [list]
foreach item $list_to_sort {
    lappend temp_list [list $item [string map $collateList $item]]
    # $collateList is what handles language-dependence 
}
set sorted_list [list]
foreach item [lsort -dictionary -index 1 $temp_list] {
    lappend sorted_list [lindex $item 0]
}
======

`$sorted_list` is now sorted in the custom order. For Swedish, the collateList should probably be something on 
the lines of

======
set collateList {å \x80 Å \x80 ä \x81 Ä \x81 ö \x82 Ö \x82 W V w v é e É E}
======

I need to do something special for `ÅÄÖåäö`, because the native order in latin-1 (and hence in Unicode) is wrong
for Swedish.  

What I miss most in Tcl's Unicode support is some way of normalizing strings (at least to some of the standard
normalisation forms defined in the standard and its annexes). '''decomposed''' and '''precomposed''' 
subcommands of `[string]` appear to me as the most obvious way to do this.

----

[LV]: If someone is looking for a project, research http://www.unicode.org/reports/tr10/tr10-10.html (or whatever variation is the latest of the document) - this is a technical note by the Unicode folks on the ''fun'' of collation of unicode strings. A Tcl extension providing an interface to the algorithm would be a wonderful contribution.


** lsort in-place reference **

[AMG]: If you want to modify a sorted list in-place and keep the result sorted, check out [Sorted Lists] for alternatives to repeatedly calling lsort.



**lsort merge**

[KBK] 2005-03-29 in the [Tcl chatroom]: Since Tcl's merge exploits natural runs, you can also merge two lists without an extra `merge` command to do it:

======
lsort [concat $list1 $list2]
======

works in O(n) time if `$list1` and `$list2` are already sorted.


** multisort **

[MG] 2005-07-29: I have a list, where each element is itself a two-element list, and I want to sort based on ''both'' elements, one after the other. It seems that the -index option to lsort can only take a single index, though - I'd been thinking of trying something like this

======
% lsort -index {1 0} [list {f 5} {a 5} {b 3}]
{b 3} {a 5} {f 5}
======

but it raises an error. Should I just use a custom `[proc]` and the `-command` option, or is there a better way that I'm not aware of? Thanks in advance for your help :)

[Peter Newman] 2005-07-29: Search for '''multisort''' above in this page and in [Additional list functions].


** `-integer` and values that are not integers **

Note that before Tcl 8.5, some argument combinations were not correct, but
were not reported as an error. For example

======
lsort -integer numlist
======

which is a mistake in a couple of ways, generates an error now where in the
past the fact that no integer arguments were provided was not raised as an
error.


** Sort by `[string length]` **

[AMG]: Here's how to sort a list by ascending `[string length]`.  It demonstrates the use of [[[apply]]] for [lambda]s.

======
set data {this is tcl}
lsort -command {apply {{a b} {expr {[string length $a] - [string length $b]}}}} $data
# returns "is tcl this"
======

** `lsort -unique -indices` **

[AMG]: The Tcl 8.6b2 [http://www.tcl.tk/man/tcl8.6/TclCmd/lsort.htm%|%man page] for `lsort`
doesn't clearly document what exactly happens when `-unique` and `-indices` are used together.  Here's how it works: the index of the ''last'' duplicate element is returned.  The man page does make mention of this behavior in connection to custom comparisons (the example used is `-index` 0), but you have to extrapolate from there to guess that `lsort` returns the index of the last duplicate when -indices is used.  When neither a custom comparison nor -indices are used, it completely does not matter which duplicate is returned.

** Checking for duplicates **

[AMG]: To check if a list contains duplicates, try:

======
if {[llength $list] != [llength [lsort -unique $list]]} {
    puts "list contains duplicates"
}
======

The number of duplicates is equal to the difference in list lengths, since [[lsort -unique]] removes duplicates.

Various [[lsort]] switches can be used to define equality.  For example, `-integer` would cause "0" and "-0" to compare equal and therefore be considered as duplicates.

Actually getting a list of duplicate keys isn't so straightforward.  [[lsort]] doesn't say what it's removing, only what it's not, so you have to subtract its result from its input.  That then gives the list of duplicates.  However, what if a key appears more than two times?  It'll show up multiple times in the duplicate list.  Cleaning that up takes another call to [[lsort -unique]].

Give this code a try.  The idea is it returns a list containing the first (lowest-indexed) instance of each duplicate word, where equality is defined according to the usual [[lsort]] switches.  Just don't pass it `-indices`, `-stride`, or `-unique`.  More work is needed to support `-indices` and `-stride`.

======
proc duplicates {args} {
    set list [lsort [lindex $args end]]
    foreach word [lsort -unique {*}$args] {
        if {[set match [lsearch -all -sorted $list $word]] ne {}} {
            set list [lreplace $list [lindex $match end] [lindex $match end]]
        }
    }
    lsort -unique {*}[lrange $args 0 end-1] $list
}
======

If you think this can be done more simply, don't forget about all the comparison switches that [[lsort]] has.

Because [[lsort]] doesn't have enough switches already (that's a joke), perhaps a `-duplicate` switch could be added to keep only the duplicate words.  In combination with `-unique`, only the first instance of each word would be reported.  I don't think this new switch would conflict with any of the existing switches.

** See Also **

   [Custom sorting]:   sort using a collation key

   [list]:   

   [Topological sort]:   

   [How fast can we sort 5 elements]:   

   [lsort index vector]:   

   [Forward-compatible lsort]:   add support for -stride to Tcl 8.5

<<categories>> Tcl syntax help | Arts and crafts of Tcl-Tk programming | Command