asveikau 10 hours ago

Sorting is language specific even if you're restricted to languages using Latin characters. Eg. How do you sort N relative to Ñ? How do you treat the Turkish variations on the letter I?

Doing a dumb sort by character or byte values is obviously the wrong call for any diacritics, but the right call may also depend on the language.

  • dmurray 9 hours ago

    And that's why there are a hundred different possible values for LC_COLLATE, and it's completely normal that two popular Unix distributions picked different default values for that setting...right?

    It would have been reasonable to conclude the article a third of the way through, and say "sorting is locale-dependent, if what you value is consistent behaviour between different OSs (instead of sorting based on the user's preferences) you need to implement the sorting yourself."

    • harrall 6 hours ago

      LC_ALL=C which gives you consistent sorting behavior.

      The article does mention it but in passing.

  • encom 8 hours ago

    Before the Danish language adopted the letter "å" (in 1948), the vowel was written as "aa". In the Danish alphabet, "å" is the last letter. Therefore a list of three Danish city names would be correctly sorted as:

      * Albertslund
      * Odense
      * Aarhus
    
    This feels like material for another Tom Scott video.
    • tpmoney 6 hours ago

      Not Tom Scott, but Dylan Beattie has done a handful of interesting talks[1] effectively on "there's no such thing as plain text" which in part covers this sort of thing. In fact, I think your Danish cities list is actually one of his examples.

      [1]: https://www.youtube.com/watch?v=gd5uJ7Nlvvo

kbd 2 hours ago

In my Zsh startup on Mac I had to worry about collation, as I expected ~ to sort last (I have a directory prefixed with ~ to load plugins that need to be loaded last). Idk why a locale of utf-8 has it sorting differently, but I needed LC_COLLATE=C to have it sort as expected:

    # source all shell config
    export LC_COLLATE=C # ensure consistent sort, ~ at end
    for file in ~/bin/shell/**/*.(z|)sh; do
      source "$file";
    done
o11c 8 hours ago

Minor note: on Debian (and possibly other distros), you don't have to use `locale-gen` to dynamically build things into `$complocaledir/locale-archive` (which, incidentally, can cause random breakage for programs that happen to start during system upgrades).

The `locales-all` package works more like macOS. It's only a ~10MB download but unpacks to take ~250MB of disk space (these numbers will vary based on your libc version and packaging format).

There are a lot of sparse arrays and UTF32 character data in compiled locales.

Incidentally, the command to dump a locale's data is:

  LC_ALL=whatever locale -ck `locale | sed 's/=.*//; /LANG\|LC_ALL/d'`
kenada 5 hours ago

When I updated the Darwin SDK and source releases in nixpkgs last year, I tried using the FreeBSD locale data. It worked in a technical sense, but it broke things that depended on the quirks in the Apple’s locale data. That statement about compatibility is unfortunately true.

1a527dd5 7 hours ago

Ask anyone who did a postgres upgrade. The words "collate" and "glibc" are enough to cause me to pause now. Learnt loads, never going to really use it again, but man do I understand the pain that causes now.

bluedino 5 hours ago

Now I'm remembering all the fun we had a long time ago with php websites that used an AS/400 for a data source. They didn't sort the same, and the mom and pop web dev shop that was hired to create the web site didn't understand the issue and hacked around it and failed.

skopje 10 hours ago

So the ISO way is the right way, right?

  • dataflow 9 hours ago

    I wondered the same. What's the right ordering?

loeg 11 hours ago

(2020)

greesil 8 hours ago

It's not a stable sort?