Silkenweb Example: Hackernews Clone

The ü/ü Conundrum

179 points by firstSpeaker 1 year ago | 275 comments

re 1 year ago
> Can you spot any difference between “blöb” and “blöb”?
It's tricky to try to determine this because normalization can end up getting applied unexpectedly (for instance, on Mac, Firefox appears to normalize copied text as NFC while Chrome does not), but by downloading the page with cURL and checking the raw bytes I can confirm that there is no difference between those two words :) Something in the author's editing or publishing pipeline is applying normalization and not giving her the end result that she was going for.
```
  00009000: 0a3c 7020 6964 3d22 3066 3939 223e 4361  .<p id="0f99">Ca
  00009010: 6e20 796f 7520 7370 6f74 2061 6e79 2064  n you spot any d
  00009020: 6966 6665 7265 6e63 6520 6265 7477 6565  ifference betwee
  00009030: 6e20 e280 9c62 6cc3 b662 e280 9d20 616e  n ...bl..b... an
  00009040: 6420 e280 9c62 6cc3 b662 e280 9d3f 3c2f  d ...bl..b...?</
```
Let's see if I can get HN to preserve the different forms:
Composed: ü Decomposed: ü
Edit: Looks like that worked!
- mgaunard 1 year ago
  I believe XML and HTML both require Unicode data to be in NFC.
  - fanf2 1 year ago
    I don’t think so?
    https://www.w3.org/TR/2008/REC-xml-20081126/#charsets
    XML 1.1 says documents should be normalized but they are still well-formed even if not normalized
    https://www.w3.org/TR/2006/REC-xml11-20060816/#sec-normaliza...
    But you should not use XML 1.1
    https://www.ibiblio.org/xml/books/effectivexml/chapters/03.h...
  - mbrubeck 1 year ago
    HTML does not require NFC (or any other specific normalization form):
    https://www.w3.org/International/questions/qa-html-css-norma...
    Neither does XML (though it XML 1.0 recommends that element names SHOULD be in NFC and XML 1.1 recommends that documents SHOULD be fully normalized):
    https://www.w3.org/TR/2008/REC-xml-20081126/#sec-suggested-n...
    https://www.w3.org/TR/xml11/#sec-normalization-checking
  - layer8 1 year ago
    You believe incorrectly. Not even Canonical XML requires normalization: https://www.w3.org/TR/xml-c14n/#NoCharModelNorm
- Eisenstein 1 year ago
  Perhaps the author used the same character twice for effect, not suspecting someone would use curl to examine the raw bytes?
mglz 1 year ago
My last name contains an ü and it has been consistenly horrible.
* When I try to preemptively replace ü with ue many institutions and companies refuse to accept it because it does not match my passport
* Especially in France, clerks try to emulate ü with the diacritics used for the trema e, ë. This makes it virtually impossible to find me in a system again
* Sometimes I can enter my name as-is and there seems to be no problem, only for some other system to mangle it to � or or a box. This often triggers error downstream I have no way of fixing
* Sometimes, people print a u and add the diacritics by hand on the label. This is nice, but still somehow wrong.
I wonder what the solution is. Give up and ask people to consistenly use a ascii-only name? Allow everybody 1000+ unicode characters as a name and go off that string? Officially change my name?
- makeitdouble 1 year ago
  The part I came to love about France in general is that while all of these are broken, the people dealing with it will completely agree it's broken and amply sympathize, but just accept your name is printed as G�nter.
  Same for names that don't fit field lengths, addresses that require street numbers etc. It's a real pain to deal with all of it and each system will fail in its own way to make your life a mess, but people will embrace the mess and won't blink an eye when you bring paper that just don't match.
  - zokier 1 year ago
    Under GDPR people have the right to have their personal data to be accurate, there was a legal case exactly about this: https://news.ycombinator.com/item?id=38009963
    - makeitdouble 1 year ago
      That's a pretty unexpected twist, and I'm frilled with it.
      I don't see every institution come up with a fix anytime soon, but having it clear that they're breaking the law is such a huge step. That will also have a huge impact on bank system development, and I wonder how they'll do it (extend the current system to have the customer facing bits rewritten, or just redo it all from top to bottom)
      There is the tale of Mizuho bank [0], botching their system upgrade project so hard they were still seeing widespread failures after a decade into it.
      [0] https://www.japantimes.co.jp/news/2022/02/11/business/mizuho...
    - rurban 1 year ago
      So it's time to finally ditch the POSIX string libc, and adopt u8 as universal string type. Which can finally find normalized.
      All the coreutils still can not find strings, just buffers. Zero terminated buffers are NOT strings, strings are unicode.
      https://perl11.github.io/blog/foldcase.html
      This is not just convenience, it also has spoofing security implications for all names. C and C++11 are insecure since C11. https://github.com/rurban/libu8ident/blob/master/doc/c11.md Most other programming languages and OS kernels also.
    - jojobas 1 year ago
      > Does it mean Z̸̰̈́̓a̸͖̰͗́l̸̻͊g̸͙͂͝ǒ̷̬́̐ can finally have a bank account?
      I wonder if this also means one can require a European bank have a name on file in Kanju, Thai script or some other not-so-well-known in Europe alphabet.
  - Faaak 1 year ago
    Ahah, I can relate to that. My driving license doesn't spell my name correctly, and somehow nobody cares. I somehow like this "nah, who cares" attitude
- zokier 1 year ago
  > * Especially in France, clerks try to emulate ü with the diacritics used for the trema e, ë. This makes it virtually impossible to find me in a system again
  In Unicode umlaut and diaeresis are both represented by same codepoint, U+0308 COMBINING DIAERESIS.
  https://en.wikipedia.org/wiki/Umlaut_(diacritic)
- samatman 1 year ago
  The only solution is going to be a lot of patience, unfortunately.
  Everyone should be storing strings as UTF-8, and any time strings are being compared they should undergo some form of normalization. Doesn't matter which, as long as it's consistent. There's no reason to store string data in any other format, and any comparison code which isn't normalizing is buggy.
  But thanks to institutional inertia, it will be a very long time before everything works that way.
  - lmm 1 year ago
    > Everyone should be storing strings as UTF-8, and any time strings are being compared they should undergo some form of normalization. Doesn't matter which, as long as it's consistent. There's no reason to store string data in any other format, and any comparison code which isn't normalizing is buggy.
    This will result in misprinting Japanese names (or misprinting Chinese names depending on the rest of your system).
    - earthboundkid 1 year ago
      Can we please talk about Unicode without the myth of Han Unification being bad somehow? The problem here is exactly the lack of unification in Roman alphabets!
    - RedNifre 1 year ago
      How?
  - bouke 1 year ago
    No system will get support for unicode by just the passing of time. Software needs to be upgraded/replaced for that to happen. Reluctant institutions will not just do that, and need external pressure.
- zokier 1 year ago
  Germans have of course a standard for this
  > a normative subset of Unicode Latin characters, sequences of base characters and diacritic signs, and special characters for use in names of persons, legal entities, products, addresses etc
  https://en.wikipedia.org/wiki/DIN_91379
  - em-bee 1 year ago
    and it's used in the passport too. so names with umlaut show up in both forms and it is possible to match either form
- Chilko 1 year ago
  > Officially change my name?
  My German last name also contains an ü, so when we emigrated to an English-speaking country and obtained dual-citizenship we used 'ue' for that passport and I now use 'ue' on a day-to-day basis. This also means I have two slightly different legal surnames depending by which passport I go.
  - hodgesrm 1 year ago
    At least German transliteration is 1-to-1. Slavic names among others often have multiple transliterations available. The Russian name Валерий can be rendered for example as Valery, Valeriy, or Valeri. It's very confusing for documents that require the person's name.
    [0] https://en.wikipedia.org/wiki/Wikipedia:Romanization_of_Russ...
    - krab 1 year ago
      That's the English transliteration. Don't forget that other Slavic languages also transcribe according to their own rules.
      For example in Czech, Валерий would be transliterated as Valerij because "j" is pronounced in Czech as English "y" in "you".
    - bobthepanda 1 year ago
      Also don't forget Chinese, which due to different romanizations or different dialects being used for the romanization, can result in different outputs depending on whether a person is from PRC, ROC, Macao, Hong Kong, or Singapore.
    - koliber 1 year ago
      Transliteration is a two way street. Non-Russian names get transliterated into Cyrillic inconsistently as well.
    - illiac786 1 year ago
      There's an ISO standard for this. Can't find it but I am 100% sure for russian for example.
  - fsckboy 1 year ago
    just out of curiosity, can you port the ue back to Germany (or wherever) or will they automatically transform it to ü? (could you change your name in a German speaking country to Mueller et al?)
    - wongarsu 1 year ago
      In Germany, there are some names that use ue, ae or oe instead of ü, ä, ö, and you run into issues with some systems wrongly autocorrecting it to the umlaut. Usually not a big deal, but having the umlaut is less error prone than the transliteration in Germany.
    - Scharkenberg 1 year ago
      The most famous German poet is (probs) Goethe. Still written with oe to this day.
- lmm 1 year ago
  > Give up and ask people to consistenly use a ascii-only name?
  > Officially change my name?
  Yes. That's the only one that's going to actually work. You can go on about how these systems ought to work until until the cows come home, and I'm sure plenty of people on HN will, but if you actually want to get on with your life and avoid problems, legally change your name to one that's short and ascii-only.
  - em-bee 1 year ago
    a friend of mine in china had a character in his name that was not in the recognized set of characters. he refused to change his name and instead submitted the character to be added to unicode (which i believe eventually happened)
    in the meantime he was unable to own the company he founded (instead made his wife the owner), had a national ID card with a different character, and i am not sure if he had a bank account, but i think the bank didn't care because laws that enforced the names to match the passport/ID only came later. i don't know how the ID didn't automatically imply a name change, but the IDs were issued automatically and maybe he filed a complaint about his name being wrong.
  - mavhc 1 year ago
    https://en.wikipedia.org/wiki/Spell_My_Name_with_an_S
  - aden1ne 1 year ago
    Names changes are only permitted in a very narrow set of conditions in my place of residence. And this would not be one of them. And I imagine that's the case in many nations.
  - taejo 1 year ago
    And then never move to Japan (or any other country where names are expected not to have Latin letters in them)
    - lmm 1 year ago
      Or rather, if you move countries, change your name to one that fits. It's pretty normal and really not that hard.
- enderstenders 1 year ago
  We clearly need to phase out name-based identification within software. "What's your name?" should never be a question heard from workers as any means of locating one's official identity in any system.
  Some form of biometrics to pull up an ID in a globally agreed-upon system is certainly the way forward. Whether or not it is close to what a final solution should be, World ID is making some effort into solving global identification problems https://worldcoin.org/world-id
  - lupire 1 year ago
    "global identification" and "final solution" dot sit well together.
  - GardenLetter27 1 year ago
    Or just standardise the alphabet...
- _v7gu 1 year ago
  Can ü be printed on a passport rather than a u? I have a ş and a ç so I have been successfully substituting s and c for them in a somewhat consistent manner.
  - mschuster91 1 year ago
    On the human-readable zone ("VIZ" in ICAO 9303) yes, see part 3 section 3.1 [1]. The MRZ however, not - it is limited to Latin alphanumeric only, see section 4.3. How to transliterate non-Latin characters is left to the discretion of the issuing government, and that has been a consistent source of annoyances for people who have identity cards issued by different governments (e.g. dual-nationals of Western European and Turkish, Arabic or Cyrillic-using Slavic countries).
    [1] https://www.icao.int/publications/Documents/9303_p3_cons_en....
- Tabular-Iceberg 1 year ago
  What’s the difference between the ë and ü diacritics? I would assume, like the French, that the two are interchangeable.
  - littlestymaar 1 year ago
    See this post [1] somewhere else in the comments.
    [1]: https://news.ycombinator.com/item?id=39818435
- gsich 1 year ago
  Passports have an entry like "corresponds to ..." for that.
- benhurmarcel 1 year ago
  When my child was born, one of the requirements I had to choose his name was that it shouldn't have any accent (or character that's not in the 26 universal letters basically).
  - em-bee 1 year ago
    who made this requirement? in which country?
    - d1sxeyes 1 year ago
      Based on OP's comment history, he's Belgian or lives in Belgium. Seems that there's no such requirement in Belgium (https://be.brussels/en/identity-nationality/children/birth-f...) and in many countries I know that ü is explicitly allowed.
      Potentially OP is talking about a set of requirements he imposed on himself?
      Edit: or maybe France? Either way, it's free choice still theoretically. https://en.wikipedia.org/wiki/Naming_law#:~:text=Since%20199....
    - benhurmarcel 1 year ago
      Sorry for the confusion, it’s just a requirement I had for myself, to make my child’s life a little easier
    - ale42 1 year ago
      Isn't it the OP him/herself? Maybe they just wanted to prevent the issue for their child...
- userbinator 1 year ago
  Everyone's name should just be a GUID. /s
  - BuyMyBitcoins 1 year ago
    Falsehoods Programmers Believe About Names, #41 - People have GUIDs.
    https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-...
weinzierl 1 year ago
This article is about a failure to do normalization properly and is not really about an issue with Unicode. Regardless what some comments seem to allude to, an Umlaut-ü should always render exactly the same, no matter how it is encoded.
There is, however, a real ü/ü conundrum, regarding ü-Umlaut and ü-diaeresis. The ü's in the words Müll and aigüe should render differently. The dots in the French word are too close to the letter. In printed French material this is usually not the case.
Unfortunately Unicode does not capture the nuance of the semantic difference between an Umlaut and a Tréma or Diaresis.
The Umlaut is a letter in its own right with its own space in the alphabet. An ü-Umlaut can never be replaced by an u alone. This would be just as wrong as replacing a p by a q. Just because they look similar does not mean they are interchangeable. [1]
The Tréma on the other hand, is a modifier that helps with proper pronunciation of letter combinations. It is not a letter in its own right, just additional information. It can sometimes move over other adjacent letters (aiguë=aigüe, both are possible) too.
Some say this should be handled by the rendering system similar to Han-Unification, but I strongly disagree with this. French words are often used in German and vice versa. Currently there is no way to render a German loan word with Umlaut (e.g. führer) properly in French.
[1] The only acceptable replacement for ü-Umlaut is the combination ue.
noodlesUK 1 year ago
One thing that is very unintuitive with normalization is that MacOS is much more aggressive with normalizing Unicode than Windows or Linux distros. Even if you copy and paste non-normalized text into a text box in safari on Mac, it will be normalized before it gets posted to the server. This leads to strange issues with string matching.
- ttepasse 1 year ago
  Unfun normalisation fact: You can’t have a file named "ss" and a file named "ß" in the same folder in Mac OS.
  - nradov 1 year ago
    There are people with the surname "Con" and it's impossible to create a file with that name in MS Windows.
    https://learn.microsoft.com/en-us/windows/win32/fileio/namin...
  - bawolff 1 year ago
    That's less a normal form issue and more a case-insensitivity issue. You also can't have a file named "a" and one named "A" in the same folder.
    - samatman 1 year ago
      That would be true if the test strings were "SS" and "ß", because although "ẞ" is a valid capitalization of "ß", it's officially a newcomer. It's more of a hybrid issue: it appears that APFS uses uppercasing for case-insensitive comparison, and also uppercases "ß" to "SS", not "ẞ". This is the default casing, Unicode also defines a "tailored casing" which doesn't have this property.
      So it isn't per se normalization, but it's not not normalization either. In any case (heh) it's a weird thing that probably shouldn't happen. Worth noting that APFS doesn't normalize file names, but normalization happens higher up in the toolchain, this has made some things better and others worse.
    - FinnKuhn 1 year ago
      That would only explain why "ß" and "ẞ" can't both be files in the same folder. "ß" and "ss" are different letter just like "u" and "ue" for example.
  - eropple 1 year ago
    This shows up in other places, too. One of my Slacks has a textji of `groß`, because I enjoy making our German speakers' teeth grind, but you sure can just type `:gross:` to get it.
    - thaumasiotes 1 year ago
      > a textji
      This is a weird formation; "ji" means text. It's half of the half of "emoji" that means text: 絵文字, 絵 [e, "picture"] 文字 [moji, "character", from 文 "text" + 字 "character"].
      https://satwcomic.com/half-human-half-scandinavian
  - yxhuvud 1 year ago
    So what happens if someone puts those two in a git repo and a Mac user checks out the folder?
    - staplung 1 year ago
      git clone https://github.com/ghurley/encodingtest Cloning into 'encodingtest'... remote: Enumerating objects: 9, done. remote: Counting objects: 100% (9/9), done. remote: Compressing objects: 100% (5/5), done. remote: Total 9 (delta 1), reused 0 (delta 0), pack-reused 0 Receiving objects: 100% (9/9), done. Resolving deltas: 100% (1/1), done. warning: the following paths have collided (e.g. case-sensitive paths on a case-insensitive filesystem) and only one from the same colliding group is in the working tree: 'ss' 'ß'
    - tetromino_ 1 year ago
      EEXIST
- codesnik 1 year ago
  I was really surprised when realized that at least in hpfs cyrillics is normalized too. For example, no russian ever thinks that Й is a И with some diacritics. It's a different letter on it's own right. But mac normalizes it into two codepoints.
  - asveikau 1 year ago
    I dislike explaining string compares to monolingual English speakers who are programmers. Similar to this phenomenon of Й/И is people who think ñ and n should compare equally, or ç and c, or that the lowercase of I is always i (or that case conversion is locale-independent).
    In something like a code review, people will think you're insane for pointing out that this type of assumption might not hold. Actually, come to think of it, explaining localization bugs at all is a tough task in general.
    - yxhuvud 1 year ago
      Or that sort order is locale independent. Swedish is a good example here as åäö are sorted at the end, and where until 2006 w was sorted as v. And then it changed and w is now considered a letter of its own.
    - iforgotpassword 1 year ago
      Well, I do like this behavior for search though. I don't want to install a new keyboard layout just to be able to search for a Spanish word.
    - koliber 1 year ago
      These are different letters for people who speak the language and treating them the same in some usage seems weird.
      At the same time, sometimes words containing those letters might show up in context where the user is not familiar with that language. Such users might not know how to enter those letters. They might not even have the capability to type those letters with their installed keyboard layouts. If they are searching for content that contains such letters (e.g. a first name), normalizing them to the visually-closest ASCII is a sensible choice, even if it makes no sense to the speakers of the language.
      It's important to understand a situation from different perspectives.
      It's not about coming up with a single correct interpretation that makes logical sense. It about making a system work in least-surprising ways to all classes of users.
    - makeitdouble 1 year ago
      The general reaction I've see until now was "meh, we have to make compromises (don't make me rewrite this for people I'll probably never meet)"
      Diacritics exacerbate this so much as they can be shared between two language yet have different rules/handling. French typically has a decent amount and they're meaningful but traditionally ignores them for comparison (in the dictionary for instance). That makes it more difficult for a dev to have an intuitive feeling of where it matters and where it doesn't.
  - bawolff 1 year ago
    Normalization isn't based on what language the text is.
    NFC just means never use combining characters if possible, and NFD means always use combining characters if possible. It has nothing to do with whether something is a "real" letter in a specific language or not.
    The whether or not something is a "real" letter vs a letter with a modifier, more comes into play in the unicode collation algorithm, which is a separate thing.
  - anamexis 1 year ago
    Well, there's no expectation in unicode that something viewed as a letter in its own right should use a single codepoint.
- sorenjan 1 year ago
  I sometimes see texts where ä is rendered as a¨, i.e. with the dots next to the a instead of above it even though it's a completely different letter and not a version of a. I managed to track the issue down to MacOS' normalization, but it has happened on big national newspapers' websites and similar. I haven't seen it in a while, maybe Firefox on Windows renders it better or maybe various publishing tools have fixed it. It looks really unprofessional which is a bit strange since I thought Apple prides themselves on their typography.
  - aidos 1 year ago
    I have never see that on all my years on a Mac (though admittedly I’m not dealing in languages where I encounter it often). I’m assuming there’s an issue with the gpos table in the font you’re using so the dots aren’t negative shifted into position as they should be?
    - sorenjan 1 year ago
      Well the point is that ä is one character, not two. It shouldn't be "a with two dots on it", it should be ä. It's its own letter with its own key on Swedish keyboards. MacOS apparently normalizes it to be two characters, and then somewhere in the publishing chain it gets mangled and end up as a¨. I have no doubt that it looked ok on the author's Mac.
      It's been a while since I last saw it, but it wasn't because of the font since it was published on a Swedish newspaper's website and other texts worked fine.
  - iforgotpassword 1 year ago
    I have that in gnome terminal. The dots always end up on the letter after, not before. At least makes it easy to spot filenames in decomposed form so I can fix them.
  - ogurechny 1 year ago
    Some old system fonts or old character rasterization engines had problems with certain diacritics, like breve, and they were moved to the space between or after characters. Some Wikipedia articles simply mention that
    > Characters may not combine well on some computers.
    It was easy to detect people typing or editing text on Apple devices because “their” characters appeared broken, unlike usual single codepoints.
- dathinab 1 year ago
  While this (probably) still applies to Apple UI elements when they switched to APFS they stopped doing Unicode normalization on filesystem level.
  So now on macOS you can have a very mixed bag with some programs normalizing, some not (it's a bug) and many expecting normalized file names.
  So it's kinda like other Linux now except a lot of dev assuming normalization is happening (and in some cases still is when the string passes through certain APIs).
  Worse due to normalization now being somewhat application/framework dependent and often going beyond basic Unicode normalization it can lead to quite not so funny bugs.
  But luckily most users will never run into any of this bugs even if the use characters which might need normalization.
- yxhuvud 1 year ago
  On the other hand, stuff written on macs are a lot more likely to require normalization in the first place.
- creshal 1 year ago
  MacOS creates so many normalization problems in mixed environments that it's not even funny any more. No common server-side CMS etc. can deal with it, so the more Macs you add to an organization, the more problems you get with inconsistent normalization in your content. (And indeed, CMSes shouldn't have to second-guess users' intentions - diacretics and umlauts are pronounced differently and I should be able to encode that difference, e.g. to better cue TTS.)
  And, of course, the Apple fanboys will just shrug and suggest you also convert the rest of the organization to Apple devices, after all, if Apple made a choice, it can't be wrong.
  - fauigerzigerk 1 year ago
    I'm not sure I understand. On the one hand you seem to be saying that users should be able to choose which normalisation form to use (not sure why). On the other hand you're unhappy about macOS sending NFD.
    If it's a user choice then CMSs have to be able to deal with all normalisation forms anyway and shouldn't care one bit whether macOS sends NFD or NFC. Mac users could of course complain about their choice not being honoured by macOS but that's of no concern to CMSs.
    - creshal 1 year ago
      > On the other hand you're unhappy about macOS sending NFD.
      Because MacOS always uses it, regardless of the user's intention, so it decomposes umlauts into diaereses (despite them having different meanings and pronunciations) and mangles cyrillic, and probably more problems I haven't yet run into.
  - cozzyd 1 year ago
    For maximum pain, they should start populating folders with .DS_STÖRE
    - eviks 1 year ago
      But store decomposed form on Tuesdays!
  - zh3 1 year ago
    Suspect you're getting downvoted because of the last sentence. However, I do sympathise with MacOS tending to mangle standard (even plain ASCII) text in a way that adds to the workload for users of other OS's.
    - creshal 1 year ago
      It adds to the workload of everyone, including the Apple users. The latter ones are just in denial about it.
  - 1 year ago
jesprenj 1 year ago
Should you really change filenames of users' files and depend on the fact that they are valid utf8? Wouldn't it be better to keep the original filename and use that most of the time sans the searches and indexing?
Why don't you normalize latin alphabets filenames for indexing even further -- allow searching for "Führer" with queries like "Fuehrer" and "Fuhrer"?
- zeroCalories 1 year ago
  I generally agree that you shouldn't change the file name, but in reality I bet OP stored it as another column in a database.
  For more aggressive normalization like that, I think it makes more sense to implement something like a spell checker that suggests similar files.
josephcsible 1 year ago
IMO, it was a mistake for Unicode to provide multiple ways to represent 100% identical-looking characters. After all, ASCII doesn't have separate "c"s for "hard c" and "soft c".
- renhanxue 1 year ago
  The problem in the linked article barely scratches the surface of the issue. You _cannot_ compare Unicode strings for equality (or sort them) without locale information. A simple example: to a Swedish or Finnish speaker, o and ö completely different letters, as distinct as a is from b, and ö sorts at the very end at the alphabet. A user that searches for ö will definitely not expect words with o to appear. However, to an American, a user that searches for "cooperation" when your text data happens to include writings by people who write like in The New Yorker, would probably to expect to find "coöperation".
  This rabbit hole goes very, very deep. In Dutch, the digraph IJ is a single letter. In Swedish, V and W are considered the same letter for most purposes (watch out, people who are using the MySQL default utf8_swedish_ci collation). The Turkish dotless i (ı) in its lowercase form uppercases to a normal I, which then does _not_ lowercase back to a dotless i if you're just lowercasing naively without locale info. In Danish, the digraph aa is an alternate way of writing å (which sorts near the end of the alphabet). Hungarian has a whole bunch of bizarre di- and trigraphs IIRC. Try looking up the standard Unicode algorithm for doing case insensitive equality comparison by the way; it's one heck of a thing.
  People somehow think that issues like these are only an issue with Han unification or something, but it's all over European languages as well. Comparing strings for equality is a deeply political issue.
  - bencelaszlo 1 year ago
    > Hungarian has a whole bunch of bizarre di- and trigraphs IIRC
    Actually, there is just only one trihraph. "dzs" almost exclusively used for representing "j" from English and other alphabets, for example "Jennifer" is "Dzsennifer" in Hungarian or "jam" is "dzsem" in the same way.
    Trigraph and digraphs actually make sense, at least as a native as these really mark similar sounds what you would think you will get by combining the given graphs. These letters doesn't cause too much issues in search in my opinion, but hyphenation is a form of art (see "magyar.ldf" for LaTeX as an example).
    To complicate the situation even further we have a/á, e/é, i/í and o/ó/ü/ő and u/ú/ü/ű letters, all of those considered to be separate ones and you can easily type them in a Hungarian desktop keyboard. On the other hand, mobile virtual keyboards usually show a QWERTY/QWERTZ layout where you can only find "long vowels" by long pressing their "short" counterparts, so when you are targeting mobile users you maybe want to differentiate between "o" and "ö", but not between "o" and "ó" nor between "ö" and "ő".
    - rat87 1 year ago
      That doesn't seem that strange Russian and I think Ukrainian (maybe some other languages that use cyrilic) have Дж as the closest thing to English J. Д is d and ж is transliterated as zh. Sometimes names are transliterated with dzh instead of j.
  - josephcsible 1 year ago
    > to an American, a user that searches for "cooperation" when your text data happens to include writings by people who write like in The New Yorker, would probably to expect to find "coöperation".
    Unicode shouldn't be responsible for making such searches work, just like it's not responsible for making searches for "analyze" match text that says "analyse".
    - renhanxue 1 year ago
      My point was simply that the fact that there are multiple representations of characters that look the same is just a tiny part of the complexity involved in making text behave like users want. It's not that uncommon for people to think that "oh I'll just normalize the string and that'll solve my problems", but normalization is just a small part of quote-unquote "proper" Unicode handling.
      The "proper" way of sorting and comparing Unicode strings is part of the standard; it's called the Unicode Collation Algorithm (https://unicode.org/reports/tr10/). It is unwieldy to say the least, but it is tuneable (see the "Tailoring" part) and can be used to implement o/ö equivalence if desired. I think it's great that this algorithm (and its accompanying Common Locale Data Repository) is in the standard and maintained by the consortium, because I definitely wouldn't want to maintain those myself.
- fhars 1 year ago
  Unicode was never designed for ease of use or efficiency of encoding, but for ease of adoption. And that meant that it had to support lossless round trips from any legacy format to Unicode and back to the legacy format, because otherwise no decision maker would have allowed to start a transition to Unicode for important systems.
  So now we are saddled with an encoding that has to be bug compatible with any encoding ever designed before.
- striking 1 year ago
  If you take a peek at an extended ASCII table (like the one at https://www.ascii-code.com/), you'll notice that 0xC5 specifies a precomposed capital A with ring above. It predates Unicode. Accepting that that's the case, and acknowledging that forward compatibility from ASCII to Unicode is a good thing (so we don't have any more encodings, we're just extending the most popular one), and understanding that you're going to have the ring-above diacritic in Unicode anyway... you kind of just end up with both representations.
  - arp242 1 year ago
    Everything can just be pre-composed; Unicode doesn't need composing characters.
    There's history here, with Unicode originally having just 65k characters, and hindsight is always 20/20, but I do wish there was a move towards deprecating all of this in favour of always using pre-composed.
    Also: what you linked isn't "ASCII" and "extended ASCII" doesn't really mean anything. ASCII is a 7-bit character set with 128 characters, and there are dozens, if not hundreds, of 8-bit character sets with 256 characters. Both CP-1252 and ISO-8859-1 saw wide use for Latin alphabet text, but others saw wide use for text in other scripts. So if you give me a document and tell me "this is extended ASCII" then I still don't know how to read it and will have to trail-and-error it.
    I don't think Unicode after U+007F is compatible with any specific character set? To be honest I never checked, and I don't see in what case that would be convenient. UTF-8 is only compatible with ASCII, not any specific "extended ASCII".
    - adrian_b 1 year ago
      In my opinion, only the reverse could be true, i.e. that Unicode does not need pre-composed characters because everything can be written with composing characters.
      The pre-composed characters are necessary only for backwards compatibility.
      It is completely unrealistic to expect that Unicode will ever provide all the pre-composed characters that have ever been used in the past or which will ever be desired in the future.
      There are pre-composed characters that do not exist in Unicode because they have been very seldom used. Some of them may even be unused in any language right now, but they have been used in some languages in the past, e.g. in the 19th century, but then they have been replaced by orthographic reforms. Nevertheless, when you digitize and OCR some old book, you may want to keep its text as it was written originally, so you want the missing composed characters.
      Another case that I have encountered where I needed composed characters not existing in Unicode was when choosing a more consistent transliteration for languages that do not use the Latin alphabet. Many such languages use quite bad transliteration systems, precisely because whoever designed them has attempted to use only whatever restricted character set was available at that time. By choosing appropriate composing characters it is possible to design improved transliterations.
    - zokier 1 year ago
      For roundtripping e.g. https://en.wikipedia.org/wiki/VSCII you do need both composing characters and precomposed characters.
    - kps 1 year ago
      > I don't think Unicode after U+007F is compatible with any specific character set?
      The ‘early’ Unicode alphabetic code blocks came from ISO 8859 encodings¹, e.g. the Unicode Cyrillic block follows ISO 8859-5, the Greek and Coptic block follows ISO 8859-7, etc.
      ¹ https://en.wikipedia.org/wiki/ISO/IEC_8859
    - bandrami 1 year ago
      > Unicode doesn't need composing characters
      But it does, IIRC, for both Bengali and Telugu.
- pavel_lishin 1 year ago
  It might not be ludicrous to suggest that the English letter "a" and the Russian letter "а" should be a single entity, if you don't think about it very hard.
  But the English letter "c" and the Russian letter "с" are completely different characters, even if at a glance they look the same - they make completely different sounds, and are different letters. It would be ludicrous to suggest that they should share a single symbol.
  - josephcsible 1 year ago
    If they're always supposed to look the same, then Unicode should encode them the same, even if they mean different things in different contexts.
    - pavel_lishin 1 year ago
      Two counterpoints:
      1. Unicode isn't a method of storing pixel or graphic representations of writing systems; it's meant to store text, regardless of how similar certain characters look.
      2. What do you do about screen readers & the like? If it encounters something that looks like a little half-moon glyph that's in the middle of a sentence about foreign alphabets that reads "Por ejemplo, la letra 'c'", should it pronounce it as the English "see" or as Russian "ess"?
    - Joker_vD 1 year ago
      What about Latin "k" and Cyrillic "к"? Do they look the same in your font of choice? Should they?
  - mzs 1 year ago
    C vs С is so strange to me. They look the same upper and lower case, italic, cursive, even are at the same location on keyboards. It's not like W is a different character in Slavic languages that use latin script even though the sound is completely different in English.
  - rat87 1 year ago
    I was thinking of Russian letter г and Ukrainian letter г.
    Or the whole eh/ye flip En/UK/Ru Eh/е/э Ye/є/е
    г/е are unified and that's probably as it should be but there are downsides.
- bawolff 1 year ago
  Maybe, but then you can no longer round trip with other encodings, which seems worse to me.
layer8 1 year ago
The more general solution is specified here: https://unicode.org/reports/tr10/#Searching
- bawolff 1 year ago
  Collation and normal forms are totally different things with different purposes and goals.
  Edit: reread the article. My comment is silly. UCA is the correct solution to the author's problem.
blablabla123 1 year ago
As a German macOS user with US keyboard I run into a related issue every now and then. What's nice about macOS is I can easily combine Umlaute but also other common letters from European languages without any extra configuration. But some (Web) Applications stumble upon it, while entering because it's like: 1. ¨ (Option-u) 2. ü (u pressed)
- kps 1 year ago
  Early on, Netscape effectively exposed Windows keyboard events directly to Javascript, and browsers on other platforms were forced to try to emulate Windows events, which is necessarily imperfect given different underlying input systems. “These features were never formally specified and the current browser implementations vary in significant ways. The large amount of legacy content, including script libraries, that relies upon detecting the user agent and acting accordingly means that any attempt to formalize these legacy attributes and events would risk breaking as much content as it would fix or enable. Additionally, these attributes are not suitable for international usage, nor do they address accessibility concerns.”
  The current method is much better designed to avoid such problems, and has been supported by all major browsers for quite a while now (the laggard Safari arriving 7 years from this Tuesday).
  https://www.w3.org/TR/uievents
chuckadams 1 year ago
Clearly the author already knows this, but it highlights the importance of always normalizing your input, and consistently using the same form instead of relying on the OS defaults.
- makeitdouble 1 year ago
  The larger point is probably that search and comparison are inherently hard as what humans understand as equivalent isn't the same for the machine. Next stop will be upper case and lower case. Then different transcriptions of the same words in CJK.
- mckn1ght 1 year ago
  Also, never trust user input. File names are user inputs. You can execute XSS attacks via filenames on an unsecured site.
userbinator 1 year ago
its[sic] 2024, and we are still grappling with Unicode character encoding problems
More like "because it's 2024." This wouldn't be a problem before the complexity of Unicode became prevalent.
- bornfreddy 1 year ago
  You mean this wouldn't be a problem if we used the myriad different encodings like we did before Unicode, because we would probably not be able to even save the files anyway? So true.
  - userbinator 1 year ago
    Before Unicode, most systems were effectively "byte-transparent" and encoding only a top-level concern. Those working in one language would use the appropriate encoding (likely CP1252 for most Latin languages) and there wouldn't be confusion about different bytes for same-looking characters.
    - deathanatos 1 year ago
      A single user system, perhaps.
      I've worked on a system that … well, didn't predate Unicode, but was sort of near the leading edge of it and was multi-system.
      The database columns containing text were all byte arrays. And because the client (a Windows tool, but honestly Linux isn't any better off here) just took a LPCSTR or whatever, it they bytes were just in whatever locale the client was. But that was recorded nowhere, and of course, all the rows were in different locales.
      I think that would be far more common, today, if Unicode had never come along.
    - bawolff 1 year ago
      My understanding is way back in the day, people would use ascii backspace to combine an ascii letter with an ascii accent character.
    - TheRealPomax 1 year ago
      SHIFT-JIS and EUC would like a word.
- n2d4 1 year ago
  You make it sound like non-English languages were invented in 2024
- mschuster91 1 year ago
  > This wouldn't be a problem before the complexity of Unicode became prevalent.
  It was a problem even before then. It worked fine as long as you had countries that were composed of one dominant ethnicity that sharted upon how minorities and immigrants lived (they were just forced to use a transliterated name, which could be one hell of a lot of fun for multi-national or adopted people) - and even that wasn't enough to prevent issues. In Germany, for example, someone had to go up to the highest public-service courts in the late 70s [1] to have his name changed from Götz to Goetz because he was pissed off that computers were unable to store the ö and so he'd liked to change his name rather than keep getting mis-named, but German bureaucracy does not like name changes outside of marriage and adoption.
  [1] https://www.schweizer.eu//aktuelles/urteile/7304-bverwg-vom-...
- bawolff 1 year ago
  Combining characters go back to the 90s. The unicode normal forms were defined in the 90s. None of this is new at this point.
_nalply 1 year ago
Sometimes it makes sense to reduce to Unicode confusables.
For example the Greek letter Big Alpha looks like uppercase A. Or some characters look very similar like the slash and the fraction slash. Yes, Unicode has separate scalar values for them.
There are Open Source tools to handle confusables.
This is in addition to the search specified by Unicode.
- wanderingstan 1 year ago
  I wrote such a library for Python here: https://github.com/wanderingstan/Confusables
  My use case was to thwart spammers in our company’s channels, but I suppose it could be used to also normalize accent encoding issues.
  Basically converts a phrase into a regular expression matching confusables.
  E.g. "ℍ℮1೦" would match "Hello"
  - _nalply 1 year ago
    Interesting.
    What would you think about this approach: reduce each character to a standard form which is the same for all characters in the same confusable group? Then match all search input to this standard form.
    This means "ℍ℮1l೦" is converted to "Hello" before searching, for example.
    - wanderingstan 1 year ago
      It’s been a long time since I wrote this, but I think the issue with that approach is the possibility of one character being confusable with more than one letter. I.e. there may not be a single correct form to reduce to.
- wyldfire 1 year ago
  > For example the Greek letter Big Alpha looks like uppercase A.
  If they're truly drawn the same (are they?) then why have a distinct encoding?
  - schoen 1 year ago
    One argument would be that you can apply functions to change their case.
    For example in Python
```
  >>> "Ᾰ̓ΡΕΤΉ".lower()
  'ᾰ̓ρετή'
  >>> "AWESOME".lower()
  'awesome'
```
    The Greek Α has lowercase form α, whereas the Roman A has lowercase form a.
    Another argument would be that you want a distinct encoding in order to be able to sort properly. Suppose we used the same codepoint (U+0050) for everything that looked like P. Then Greek Ρόδος would sort before Greek Δήλος because Roman P is numerically prior to Greek Δ in Unicode, even though Ρ comes later than Δ in the Greek alphabet.
    - mmoskal 1 year ago
      Apparently this works very well, except for a single letter, Turkish I. Turkish has two version of 'i' and Unicode folks decided to use the Latin 'i' for lowercase dotted i, and Latin 'I' for uppercase dot-less I (and have two new code points for upper-case dotted I and lower-case dot-less I).
      Now, 'I'.lower() depends on your locale.
      A cause for a number of security exploits and lots of pain in regular expression engines.
      edit: Well, apparently 'I'.lower() doesn't depend on locale (so it's incorrect for Turkish languages); in JS you have to do 'I'.toLocaleLowerCase('tr-TR'). Regexps don't support it in neither.
  - ninkendo 1 year ago
    To me, it depends on what you think Unicode’s priorities should be.
    Let’s consider the opposite approach, that any letters that render the same should collapse to the same code point. What about Cherokee letter “go” (Ꭺ) versus the Latin A? What if they’re not precisely the same? Should lowercase l and capital I have the same encoding? What about the Roman numeral for 1 versus the letter I? Doesn’t it depend on the font too? How exactly do you draw the line?
    If Unicode sets out to say “no two letters that render the same shall ever have different encodings”, all it takes is one counterexample to break software. And I don’t think we’d ever get everyone to agree on whether certain letters should be distinct or not. Look at Han unification (and how poorly it was received) for examples of this.
    To me it’s much more sane to say that some written languages have visual overlap in their glyphs, and that’s to be expected, and if you want to prevent two similar looking strings from being confused with one another, you’re going to have to deploy an algorithm to de-dupe them. (Unicode even has an official list of this called “confusables”, devoted to helping you solve this.)
  - layer8 1 year ago
    They can be drawn the same, but when combining fonts (one latin, one greek), they might not. Or, put differently, you don’t want to require the latin and greek glyphs to be designed by the same font designer so that “A” is consistent with both.
    There are more reasons:
    – As a basic principle, Unicode uses separate encodings when the lower/upper case mappings differ. (The one exception, as far as I know, being the Turkish “I”.)
    – Unicode was designed for round-trip compatibility with legacy encodings (which weren’t legacy yet at the time). To that effect, a given script would often be added as whole, in a contiguous block, to simplify transcoding.
    – Unifying characters in that way would cause additional complications when sorting.
  - andrewaylett 1 year ago
    In some cases, because they have distinct encodings in a pre-Unicode character set.
    Unicode wants to be able to represent any legacy encoding in a lossless manner. ISO8859-7 encodes Α and A to different code-points, and ISO8859-5 has А at yet another code point, so Unicode needs to give them different encodings too.
    And, indeed, they are different letters -- as sibling comments point out, if you want to lowercase them then you wind up with α, a, and а, and that's not going to work very well if the capitals have the same encoding.
  - michaelt 1 year ago
    Unicode's "Han Unification" https://en.wikipedia.org/wiki/Han_unification aimed to create a unified character set for the characters which are (approximately) identical between Chinese, Japanese, Korean and Vietnamese.
    It turns out this is complex and controversial enough that the wikipedia page is pretty gigantic.
  - samatman 1 year ago
    The basic answer here is that Unicode exists to encode characters, or really, scripts and their characters. Not typefaces or fonts.
    Consider broadcasting of text in Morse code. The Morse for the Cyrillic letter В is International Morse W.
    In the early years of Unicode, conversion from disparate encodings to Unicode was an urgent priority. Insofar as possible, they wanted to preserve the collation properties of those encodings, so the characters were in the same order as the original encoding whenever they could be.
    But it's more that Unicode encodes scripts, which have characters, it doesn't encode shapes. With 10,000 caveats to go with that, Unicode is messy and will preserve every mistake until the end of time. But encoding Α and A and А as three different letters, that they did on purpose, because they are three different letters, because they're a part of three different scripts.
    - schoen 1 year ago
      It occurs to me (after mentioning collation order, in a different part of this thread, as one reason that we would want to distinguish scripts) that it might be unclear even for collation purposes when scripts are or are not distinct, especially for Cyrillic, Latin, and Arabic scripts which are used to write many different languages which have often added their own extensions.
      I guess the official answer is "attempt to distinguish everything that any language is known to distinguish, and then use locales to implement different collation orders by language", or something like that?
      But it's still not totally obvious how one could make a principled decision about, say, whether the encoding of Persian and Urdu writing (obviously including their extensions) should be unified with the encoding of Arabic writing. One could argue that Nastaliq is like a "font"... or not...
  - adzm 1 year ago
    > If they're truly drawn the same (are they?) then why have a distinct encoding?
    They may be drawn the same or similar in some typefaces but not all.
  - crote 1 year ago
    Because some characters which look the same need to be treated differently depending on context. A 'toLowercase' function would convert Α->α, but A->a. That would be impossible if both variants had the same encoding.
  - mgaunard 1 year ago
    Because graphemes and glyphs are different things.
  - hanche 1 year ago
    You may be amused to learn about these, then:
    U+2012 FIGURE DASH, U+2013 EN DASH and U+2212 MINUS SIGN all look exactly the same, as far as I can tell. But they have different semantics.
    - layer8 1 year ago
      They don’t necessarily look the same. The distinction is typographic, and only indirectly semantic.
      Figure dash is defined to have the same width as a digit (for use in tabular output). Minus sign is defined to have the same width and vertical position as the plus sign. They may all three differ for typographic reasons.
    - ahazred8ta 1 year ago
      In Hawaiʻi, there's a constant struggle between the proper ʻokina, left single quote, and apostrophe.
Havoc 1 year ago
For those intrigued by this sort of thing check tech talk “plain text” by Dylan Beattie
Absolute gem. His other talks are entertaining too
- hanche 1 year ago
  He seems to have done that talk several times. I watched the 2022 one. Time well spent!
mawise 1 year ago
I ran into this building search for a family tree project. I found out that Rails provides `ActiveSupport::Inflector.transliterate()` which I could use for normalization.
anewhnaccount2 1 year ago
Reminded of this classic diveintomark post http://web.archive.org/web/20080209154953/http://diveintomar...
CoastalCoder 1 year ago
Isn't ü/ü-encoding a solved problem on Unix systems?
</joke>
philkrylov 1 year ago
The article suggests using NFC normalization as a simple solution, but fails to mention that HFS+ always does NFD normalization to file names, and APFS kinda does not but some layer above it actually does (https://eclecticlight.co/2021/05/08/explainer-unicode-normal...), and ZFS has this behavior controlled by a dataset-level option. I don't see how applying its suggestion literally (just normalize to NFC before saving) can work.
jph 1 year ago
Normalizing can help with search. For example for Ruby I maintain this gem: https://rubygems.org/gems/sixarm_ruby_unaccent
- noname120 1 year ago
  Wow the code[1] looks horrific!
  Why not just do this: string → NFD → strip diacritics → NFC? See [2] for more.
  [1] https://github.com/SixArm/sixarm_ruby_unaccent/blob/eb674a78...
  [2] https://stackoverflow.com/a/74029319/3634271
  - jph 1 year ago
    Sure does look horrific. :-) That's because it's the same code from 2008, long before Ruby had the Unicode handlers. In fact it's the same code as for many other programming languages, all the way back to Perl in the mid-1990s. I didn't create it; I merely ported it from Perl to Ruby.
    More important, the normalization does more than just diacritics. For example, it converts superscript 2 to ASCII 2. A better naming convention probably would have been "string normalize" or "searchable string" or some such, but the naming convention in 2012 was based on Perl.
kazinator 1 year ago
Oh that Mötley Ünicöde.
- lxgr 1 year ago
  I'm aware of the "metal umlaut" meme, but as a German native speaker, I can't not read these in my head in a way that sounds much less Metal than probably intended :)
  - 082349872349872 1 year ago
    > "When we finally went to Germany, the crowds were chanting, ‘Mutley Cruh! Mutley Cruh!’ We couldn’t figure out why the fuck they were doing that." —VNW
  - Symbiote 1 year ago
    Years ago, an American metalhead was added to a group chat before she came to visit.
    She was called Daniela, but she'd written it "Däniëlä". When my Swedish friend met her in person, havin seen her name in the group chat, he said something like "Hej, Dayne-ee-lair right? How was the flight?".
  - ooterness 1 year ago
    The best metal umlauts are placed on a consonant (e.g., Spın̈al Tap). This makes it completely clear when it's there for aesthetics and not pronunciation.
  - ginko 1 year ago
    I will always pronounce the umlaut in Motörhead. Lemmy brought that on himself.
  - yxhuvud 1 year ago
    Yes, those umlauts made it sound more like a fake french accent.
  - 1 year ago
- 082349872349872 1 year ago
  It can encode Spın̈al Tap, so it's all good.
  - chuckadams 1 year ago
    Oh sweet summer child, i̶̯͖̩̦̯͉͈͎͛̇͗̌͆̓̉̿̇̚͜͝͠ͅt̶̥̳͙̺̀͊͐͘ ̷̧͉̲̩̩̠̥̀̍̔͝c̸̢̛̙̦͙̠̱̖̠͆̆̄̈́͋͘ą̴̩̪̻̭̐́̒n̶̡̛̛̳̗̦͚̙̖͓̝̻̓̔̎̎̅̒͊ͅ ̵̰̞̰̺̠̲̯̤̠̹̯̩͚̥̗͌̓e̴̪̯̠͙̩̝͓̎́̋̈́̂̓̏̈͗͛̓̀̾͗͘n̶͕̗̣͙̺̰̠͐́͆̀́̌͑̔̊̚ĉ̴̗͔̼̦̟̰͐̌̂̅͋̄̄͘̕̚o̵̧͙̤͔̻̞̝̯̱̰̤̻̠̝̎͐̈́̈̐͆͑̃̀̏̂͝͠͝d̸͕̼̀̐̚ế̴̢̢̡̳͇̪̤͇͉̳̟̈̈̈́̎̀̋͆͊̃̓͛̈́͘ ̷̞̞̜̖͇̱̞͔̈́͋̈́̃̎̇̈͜͝ͅs̷̢̡͚͉͚̬̙̼̾̅̀̊̈́̏̇͘͜ö̸̥̠̲̞̪̦͚̞̝̦́̃̈́́̊͐̾̏̂͂̓̋͋̚͠ ̶̞̺̯̖͓̞͇̳͈̗͖̗̫̍̌̋̈͗̉͝͠m̶̳̥͔͔͚̈́̕̕̚͘͜͠u̵͚̓͗̔̐̽̍ċ̷̨̢̡̛̭͓̪͕̗̝̟͓̩͇͒̽͒͑̃́̇͌̊͊̄̈́͘͜h̶̳̮̟̃͂͛̑̚̚ ̵̢͉̣̲͇͕̈̈̍̕͘ͅm̴̱͙̜͔̋̐̅͗̋̈̀̌͛̈͘̕͠o̷̧̡̮̜͎͙̖̞͈̘̩̙͓̿̆̀̋͜r̶͙̗̯͎̎͛̌̈́̂̓̈̑̅̓͊̒̊̑̈ę̷͕͉̲̟̽̄͒̍͑̀̿̔̒̃̅̿́͘͝ͅ.̷̡̧̻̘̝̞̹̯̞͚̱̼͓̠͇̌̅͂.̷̧̫͙̮̞̳̼̤̪̖̦̟͕̏̐͑̾̈́̀̅͌̓.̵̧̛̛̖̥͔͍̲̲͉̺̩̪̭̋́̓̌͂̽̋̃̎͋͆͝͠ͅ
    - 082349872349872 1 year ago
      TIL about https://esolangs.org/wiki/Zalgo#Number_to_String
raffy 1 year ago
I created a bunch of Unicode tools during development of ENSIP-15 for ENS (Ethereum Name Service)
ENSIP-15 Specification: https://docs.ens.domains/ensip/15
ENS Normalization Tool: https://adraffy.github.io/ens-normalize.js/test/resolver.htm...
Browser Tests: https://adraffy.github.io/ens-normalize.js/test/report-nf.ht...
0-dependancy JS Unicode 15.1 NFC/NFD Implementation [10KB] https://github.com/adraffy/ens-normalize.js/blob/main/dist/n...
Unicode Character Browser: https://adraffy.github.io/ens-normalize.js/test/chars.html
Unicode Emoji Browser: https://adraffy.github.io/ens-normalize.js/test/emoji.html
Unicode Confusables: https://adraffy.github.io/ens-normalize.js/test/confused.htm...
WalterBright 1 year ago
> Can you spot any difference between “blöb” and “blöb”?
That's where Unicode lost its way and went into a ditch. Identical glyphs should always have the same code point (or sequence of code points).
Imagine all the coding time spent trying to deal with this nonsense.
- euroderf 1 year ago
  A fine sentiment, but (FWIW) it goes into a ditch when dealing with CJK.
  - WalterBright 1 year ago
    One unique sequence per unique glyph takes care of all that.
    - euroderf 1 year ago
      Ah, but define "unique" after centuries of borrowing.
ulrischa 1 year ago
It is really so awful that we have to deal with encoding issues in 2024.
ComputerGuru 1 year ago
ZFS can be configured to force the use of a particular normalized Unicode form for all filenames. Amazing filesystem.
NotYourLawyer 1 year ago
ASCII should be enough for anyone.
- zzo38computer 1 year ago
  ASCII is good for a lot of stuff, but not for everything. Sometimes, other character sets/encodings will be better, but which one is better depends on the circumstances. (Unicode does have many problems, though. My opinion is that Unicode is no good.)
- hanche 1 year ago
  And who needs more than 640 kilobytes of memory anyhow?
  - mckn1ght 1 year ago
    Don’t forget butterflies in case you need to edit some text.
- euroderf 1 year ago
  Filling the upper 128 characters with box-drawing characters was all well & fine, but you'd think IBM might've given some thought instead to defining a character set that would have maximum applicability for the set of all (Roman alphabet -descended) Western languages. (Plus pinyin.)
earthboundkid 1 year ago
This isn’t an encoding problem. It’s a search problem.
juujian 1 year ago
I ran into encoding problems so many times, I just use ASCII aggressively now. There is still kanji, Hanzi, etc. but at least for Western alphabets, not worth the hassle.
- zzo38computer 1 year ago
  I also just use ASCII when possible; it is the most likely to work and to be portable. For some purposes, other character sets/encodings are better, but which ones are better depends on the specific case (not only what language of text but also the use of the text in the computer, etc).
- arp242 1 year ago
  This works fine as a personal choice, but doesn't really work if you're writing something other random people interact with.
  Even for just English it doesn't work all that well because it lacks things like the Euro which is fairly common (certainly in Europe), there are names with diacritics (including "native" names, e.g. in Ireland it's common), there are too many loanwords with diacritics, and ASCII has a somewhat limited set of punctuation.
  There are some languages where this can sort of work (e.g. Indonesian can be fairly reliably written in just ASCII), although even there you will run in to some of these issue. It certainly doesn't work for English, and even less for other Latin-based European languages.
- layer8 1 year ago
  The article isn’t about non-Unicode encodings.
  - juujian 1 year ago
    Meant to write ASCII
keybored 1 year ago
I try to avoid Unicode in filenames (I’m on Linux). It seems that a lot of normal users might have the same intuition as well? I get the sense that a lot will instinctually transcode to ASCII, like they do for URLs.
- zzo38computer 1 year ago
  I also try to avoid non-ASCII characters in file names (and I am also on Linux). I also like to avoid spaces and most punctuations in file names (if I need word separation I can use underscores or hyphens).
  - skissane 1 year ago
    Sometimes I wish they had disallowed spaces in file names.
    Historically, many systems were very restrictive in what characters are allowed in file names. In part in reaction to that, Unix went to the other extreme, allowing any byte except NUL and slash.
    I think that was a mistake - allowing C0 control characters in file names (bytes 0x01 thru 0x1F) serves no useful use case, it just creates the potential for bugs and security vulnerabilities. I wish they’d blocked them.
    POSIX debated banning C0 controls, although appears to have settled on just a recommendation (not a mandate) that implementations disallow newline: https://www.austingroupbugs.net/view.php?id=251
    - samatman 1 year ago
      I firmly agree that control characters, including tab and newline, should have been shown the door decades ago. All they do is make problems.
      But spaces in filenames are really just an inconvenience at most for heavy terminal users, and are a natural thing to use for basically everyone else. All my markdown files are word-word-word.md, but all my WYSIWIG documents are "Word word word.doc".
      The hassle of constantly explaining to angry civilians "why won't it let me write this file" would be worse than the hassle of having to quote or backslash-escape the occasional path in the shell.
- keybored 1 year ago
  I argue that using more Unicode instead ASCII—people disagree. I say that I use ASCII-only in filenames (because filenames suck between platforms, and in general) and people downvote. :)