Writing a Comparator for Natural Sort Order

Natural sort order is what we usually see in most file managers when browsing a list of numbered files, that it sorts the numerical part of the file name as a number instead of comparing it character by character. For instance:

Lexicographical sort	Natural sort
1.txt	1.txt
10.txt	2.txt
100.txt	3.txt
101.txt	4.txt
102.txt	5.txt
103.txt	10.txt
104.txt	11.txt
105.txt	12.txt
11.txt	13.txt
12.txt	14.txt
13.txt	15.txt
14.txt	100.txt
15.txt	101.txt
2.txt	102.txt
3.txt	103.txt
4.txt	104.txt
5.txt	105.txt

Lexicographical sort can be easily implemented using String::compareToIgnoreCase, but it is not very acceptable for end users. However for natural sort, things is actually a little more complicated.

The Definition

There is a small Wikipedia article on Natural Sort Order, which mentioned the following definition for natural sort order:

Natural sort order is an ordering of strings in alphabetical order, except that multi-digit numbers are ordered as a single character.

However, there’s still some undefined behavior about this sort order when case-insensitivity is involved alongside treating numbers as a whole. In an article discussing natural sort order in C, the author mentioned the following case:

Name
Test1
tesT1
test1
tEst2
test2
test3

These string is listed in a reasonable way, however if you simply define the natural sorting to be sorting chunks of characters and numbers individually, the Test string have to be equal to test so that the ordering won’t be deterministic.

And there’s another case involving leading zeros:

Name
test1
test01
test001
test01a

Intuitively, things that are shorter should come before those longer ones, so equal numbers with less leading zeros should come before those with more. Furthermore, if there’s more text following the number, it should come after those without, thus the aforementioned leading zero comparison has a lower priority.

I didn’t find any standard for these corner cases, so I’m merely deciding on the most reasonable approach.

Various Implementations

There’s an article on CodingHorror and a StackOverflow question that referenced some existing implementations.

And here are some additional implementations that I found inside some battle-tested applications:

Cabinet/AlphanumComparator: Adapted from DaveKoelle’s AlphanumComparator.
glib/g_utf8_collate_key_for_filename: Used by Nautilus, generates a new string that can be used for comparison.
gnulib/filevercmp: Used by GNU coreutils’ ls -v.
coreutils/sort: Used by GNU coreutils’ sort -V.

But there’s some common issues in existing implementations:

Use of regular expression: The comparator is frequently evaluated during the sort, so performance is our concern and splitting on regular expressions is not what we expect under such conditions.
Memory allocation: Also due to the frequent invocation, we should avoid memory allocation (e.g. actually building those “chunks”), and this can be done by iteration over the two strings.
Unicode awareness: Unicode code points are not necessarily single chars (think of surrogates), so those code point related methods should be used instead of simple increment to advance the index of current “character”.
Locale awareness: Instead of c >= '0' && c <= '9', one should use Character.isDigit() and Character.digit() for getting numbers out of a character.
Correctness: Some implementations cannot handle the two aforementioned corner cases properly.
Coding style: Some implementations are simply not so readable.

My Implementation

So I wrote my own implementation with those issues in mind, and it basically iterates over the two string and handles consecutive digits together. Most of the code is self-explanatory and the comments can also describe the algorithm:

public class NaturalOrderComparator implements Comparator<String> {

    private static final int DIGIT_RADIX = 10;

    @Override
    public int compare(String string1, String string2) {
        int start1 = 0;
        int start2 = 0;
        int leadingZeroCompareResult = 0;
        while (start1 < string1.length() && start2 < string2.length()) {
            int codePoint1 = string1.codePointAt(start1);
            int codePoint2 = string2.codePointAt(start2);
            // Check if both code points are digits.
            if (!Character.isDigit(codePoint1) || !Character.isDigit(codePoint2)) {
                if (!codePointEqualsIgnoreCase(codePoint1, codePoint2)) {
                    return codePointCompareToIgnoreCase(codePoint1, codePoint2);
                }
                start1 = string1.offsetByCodePoints(start1, 1);
                start2 = string2.offsetByCodePoints(start2, 1);
                continue;
            }
            // Get end of current number.
            int end1 = start1;
            do {
                end1 = string1.offsetByCodePoints(end1, 1);
            } while (end1 < string1.length() && Character.isDigit(string1.codePointAt(end1)));
            int end2 = start2;
            do {
                end2 = string2.offsetByCodePoints(end2, 1);
            } while (end2 < string2.length() && Character.isDigit(string2.codePointAt(end2)));
            // Get start of current number without leading zeros.
            int noLeadingZeroStart1 = start1;
            while (noLeadingZeroStart1 < end1 && Character.digit(string1.codePointAt(
                    noLeadingZeroStart1), DIGIT_RADIX) == 0) {
                noLeadingZeroStart1 = string1.offsetByCodePoints(noLeadingZeroStart1, 1);
            }
            int noLeadingZeroStart2 = start2;
            while (noLeadingZeroStart2 < end2 && Character.digit(string2.codePointAt(
                    noLeadingZeroStart2), DIGIT_RADIX) == 0) {
                noLeadingZeroStart2 = string2.offsetByCodePoints(noLeadingZeroStart2, 1);
            }
            // If the two lengths of numbers (without leading zeros) differ, the shorter one comes
            // first.
            int noLeadingZeroLength1 = string1.codePointCount(noLeadingZeroStart1, end1);
            int noLeadingZeroLength2 = string2.codePointCount(noLeadingZeroStart2, end2);
            if (noLeadingZeroLength1 != noLeadingZeroLength2) {
                return noLeadingZeroLength1 - noLeadingZeroLength2;
            }
            // If any two digits starting from the first non-zero ones differs, the less one comes
            // first.
            for (int digitIndex1 = noLeadingZeroStart1, digitIndex2 = noLeadingZeroStart2;
                 digitIndex1 < end1; digitIndex1 = string1.offsetByCodePoints(digitIndex1, 1),
                         digitIndex2 = string2.offsetByCodePoints(digitIndex2, 1)) {
                int digit1 = Character.digit(string1.codePointAt(digitIndex1), DIGIT_RADIX);
                int digit2 = Character.digit(string2.codePointAt(digitIndex2), DIGIT_RADIX);
                if (digit1 != digit2) {
                    return digit1 - digit2;
                }
            }
            // If the two numbers are the same, the one with less leading zeros (shorter) comes
            // first.
            int leadingZeroLength1 = string1.codePointCount(start1, noLeadingZeroStart1);
            int leadingZeroLength2 = string2.codePointCount(start2, noLeadingZeroStart2);
            if (leadingZeroLength1 != leadingZeroLength2) {
                if (leadingZeroCompareResult == 0) {
                    leadingZeroCompareResult = leadingZeroLength1 - leadingZeroLength2;
                }
            }
            start1 = end1;
            start2 = end2;
        }
        // If one of the two strings is exhausted first, it comes first.
        int remainingLength1 = string1.codePointCount(start1, string1.length());
        int remainingLength2 = string2.codePointCount(start2, string2.length());
        if (remainingLength1 != remainingLength2) {
            return remainingLength1 - remainingLength2;
        }
        // The one with less leading zeros (shorter) comes first if others are the same.
        if (leadingZeroCompareResult != 0) {
            return leadingZeroCompareResult;
        }
        // Fall back to plain comparison.
        return string1.compareTo(string2);
    }

    // @see String#regionMatches(boolean, int, String, int, int)
    private static boolean codePointEqualsIgnoreCase(int codePoint1, int codePoint2) {
        codePoint1 = Character.toUpperCase(codePoint1);
        codePoint2 = Character.toUpperCase(codePoint2);
        if (codePoint1 == codePoint2) {
            return true;
        }
        codePoint1 = Character.toLowerCase(codePoint1);
        codePoint2 = Character.toLowerCase(codePoint2);
        return codePoint1 == codePoint2;
    }

    // @see String.CaseInsensitiveComparator#compare(String, String)
    private static int codePointCompareToIgnoreCase(int codePoint1, int codePoint2) {
        if (codePoint1 != codePoint2) {
            codePoint1 = Character.toUpperCase(codePoint1);
            codePoint2 = Character.toUpperCase(codePoint2);
            if (codePoint1 != codePoint2) {
                codePoint1 = Character.toUpperCase(codePoint1);
                codePoint2 = Character.toUpperCase(codePoint2);
                if (codePoint1 != codePoint2) {
                    return codePoint1 - codePoint2;
                }
            }
        }
        return 0;
    }
}

The code is also available as a GitHub gist and is licensed under the Apache 2.0 license.

Further Considerations

In fact, to account for Unicode’s compatibility decomposition, canonical composition and case folding, an ICU Normalizer should be used with mode NFKC_Casefold. However, since the ICU package is not available on Android until Android Oreo, and the JDK implementation didn’t account for such level of correctness, I’m currently leaving it out and you can trivially add it by applying the normalization before comparison.