Natural sort order is what we usually see in most file managers when browsing a list of numbered files, that it sorts the numerical part of the file name as a number instead of comparing it character by character. For instance:
Lexicographical sort | Natural sort |
---|---|
1.txt | 1.txt |
10.txt | 2.txt |
100.txt | 3.txt |
101.txt | 4.txt |
102.txt | 5.txt |
103.txt | 10.txt |
104.txt | 11.txt |
105.txt | 12.txt |
11.txt | 13.txt |
12.txt | 14.txt |
13.txt | 15.txt |
14.txt | 100.txt |
15.txt | 101.txt |
2.txt | 102.txt |
3.txt | 103.txt |
4.txt | 104.txt |
5.txt | 105.txt |
Lexicographical sort can be easily implemented using String::compareToIgnoreCase
, but it is not very acceptable for end users. However for natural sort, things is actually a little more complicated.
The Definition
There is a small Wikipedia article on Natural Sort Order, which mentioned the following definition for natural sort order:
Natural sort order is an ordering of strings in alphabetical order, except that multi-digit numbers are ordered as a single character.
However, there’s still some undefined behavior about this sort order when case-insensitivity is involved alongside treating numbers as a whole. In an article discussing natural sort order in C, the author mentioned the following case:
Name |
---|
Test1 |
tesT1 |
test1 |
tEst2 |
test2 |
test3 |
These string is listed in a reasonable way, however if you simply define the natural sorting to be sorting chunks of characters and numbers individually, the Test
string have to be equal to test
so that the ordering won’t be deterministic.
And there’s another case involving leading zeros:
Name |
---|
test1 |
test01 |
test001 |
test01a |
Intuitively, things that are shorter should come before those longer ones, so equal numbers with less leading zeros should come before those with more. Furthermore, if there’s more text following the number, it should come after those without, thus the aforementioned leading zero comparison has a lower priority.
I didn’t find any standard for these corner cases, so I’m merely deciding on the most reasonable approach.
Various Implementations
There’s an article on CodingHorror and a StackOverflow question that referenced some existing implementations.
And here are some additional implementations that I found inside some battle-tested applications:
- Cabinet/AlphanumComparator: Adapted from DaveKoelle’s
AlphanumComparator
. - glib/g_utf8_collate_key_for_filename: Used by Nautilus, generates a new string that can be used for comparison.
- gnulib/filevercmp: Used by GNU coreutils’
ls -v
. - coreutils/sort: Used by GNU coreutils’
sort -V
.
But there’s some common issues in existing implementations:
- Use of regular expression: The comparator is frequently evaluated during the sort, so performance is our concern and splitting on regular expressions is not what we expect under such conditions.
- Memory allocation: Also due to the frequent invocation, we should avoid memory allocation (e.g. actually building those “chunks”), and this can be done by iteration over the two strings.
- Unicode awareness: Unicode code points are not necessarily single
char
s (think of surrogates), so those code point related methods should be used instead of simple increment to advance the index of current “character”. - Locale awareness: Instead of
c >= '0' && c <= '9'
, one should useCharacter.isDigit()
andCharacter.digit()
for getting numbers out of a character. - Correctness: Some implementations cannot handle the two aforementioned corner cases properly.
- Coding style: Some implementations are simply not so readable.
My Implementation
So I wrote my own implementation with those issues in mind, and it basically iterates over the two string and handles consecutive digits together. Most of the code is self-explanatory and the comments can also describe the algorithm:
1 | public class NaturalOrderComparator implements Comparator<String> { |
The code is also available as a GitHub gist and is licensed under the Apache 2.0 license.
Further Considerations
In fact, to account for Unicode’s compatibility decomposition, canonical composition and case folding, an ICU Normalizer
should be used with mode NFKC_Casefold
. However, since the ICU package is not available on Android until Android Oreo, and the JDK implementation didn’t account for such level of correctness, I’m currently leaving it out and you can trivially add it by applying the normalization before comparison.