Understanding Surrogate Pairs: Why Some Windows Filenames Can't Be Read

85 points by feldrim 4 months ago | 62 comments
  • dwdz 4 months ago
    The script works just fine on real Linux, it creates 2048 files and ls command lists them all with different names.

        ls -l win32/
        total 0
        -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\277\237''.exe'
        -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\267\213''.exe'
        -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\240\220''.exe'
        -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\274\273''.exe'
        -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\251\205''.exe'
        -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\255\223''.exe'
        -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\272\257''.exe'
        -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\264\207''.exe'
        -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\261\246''.exe'
        -rw-r--r-- 1 dawid dawid 0 Feb 26 12:13 ''$'\355\254\266''.exe'
        ...
    • 4 months ago
      • 4 months ago
        • feldrim 4 months ago
          Oh, great. Can you also share the locale? I'll write another Postscriptum section then.
          • dwdz 4 months ago

                LANG=en_IE.UTF-8
                LANGUAGE=en_IE:en
                LC_CTYPE="en_IE.UTF-8"
                LC_NUMERIC="en_IE.UTF-8"
                LC_TIME="en_IE.UTF-8"
                LC_COLLATE="en_IE.UTF-8"
                LC_MONETARY="en_IE.UTF-8"
                LC_MESSAGES="en_IE.UTF-8"
                LC_PAPER="en_IE.UTF-8"
                LC_NAME="en_IE.UTF-8"
                LC_ADDRESS="en_IE.UTF-8"
                LC_TELEPHONE="en_IE.UTF-8"
                LC_MEASUREMENT="en_IE.UTF-8"
                LC_IDENTIFICATION="en_IE.UTF-8"
                LC_ALL=
            • rstuart4133 4 months ago
              Not the OP, but have you tried LANG=C and getting rid of the rest?
        • account42 4 months ago
          Falsehoods programmers believe about filenames #1: Filenames are text and can be represented in common text encodings.

          > Windows was an early adopter of Unicode, and its file APIs use UTF‑16 internally since Windows 2000

          Wrong. Windows uses WTF-16 [0] despite what the documentation says.

          [0] https://simonsapin.github.io/wtf-8/#ill-formed-utf-16

          • ripe 4 months ago
            Thank you for posting the WTF-16 document. Very relevant for OP.

            It's an old problem that people on different OS's need to access the same filesystem, particularly Windows clients versus UNIXy clients.

            While UNIX filesystems traditionally accept any sequence of bytes except slash and NUL, and treat "." and ".." specially, Windows filesystems have had many additional restrictions on valid filenames, e.g., a list of "reserved names" that should not be used, short and long names, and case-insensitive names [1].

            The NetApp filer, a NAS storage appliance, runs a specialized OS called Data ONTAP, which maintains a special on-disk representation called WAFL to enable this. WAFL is famous for its copy-on-write representation for file contents. [3] But in practice, most people in the real world are affected by its treatment of filenames. It's useful to take a look at how it solves these problems.

            The NAS presents the same set of files both as an NFS volume for UNIXy clients, as well as a CIFS volume for Windows clients. It does this by enhancing the directory entries with additional information and configuration features to satisfy both requirements. For typical problems and what features they offer to solve them, see their documentation page about naming files and paths [2].

            [1] https://learn.microsoft.com/en-us/windows/win32/fileio/namin...

            [2] https://docs.netapp.com/us-en/ontap/nfs-admin/multi-byte-fil...

            [3] Dave Hitz, "File System Design for an NFS File Server Appliance", https://www.cs.princeton.edu/courses/archive/fall17/cos318/r...

            • sumtechguy 4 months ago
              If you want to subtly break a windows install you can use setCaseSensitiveInfo with fsutil. It turns on/off case sensitivity for a directory. There is also a similar set of options for samba shares which comes with interesting tradeoffs for speed of reading a directory list.
              • p_ing 4 months ago
                NTFS has the same two restrictions that many UNIX file systems have, NUL and slash.

                The APIs, Win32 in the case of [1], have further restrictions. If you want you can use a different API/personality and write whatever value you'd like (sans NUL and /) provided said personality has no limits -- NTFS does no validation itself.

                In practice, Win32 being the default personality makes said plethlora of restrictions true, but it isn't a "filesystem" limitation, rather an API restriction. A nuanced if unimportant difference.

              • zombot 4 months ago
                Microsoft never implements a standard, they only ever implement their own shit. Sometimes it's a close enough parody of a standard to fool superficial onlookers, but that's as close as you'll ever get.
                • formerly_proven 4 months ago
                  Java, NT, .NET, "wide" C and C++ and a few others from the same time frame ended up with WTF-16 because surrogate pairs didn't exist when they were designed. They were designed with UCS-2, which is a fixed-length encoding. Unicode 2.0 then extended that to be variable length (16/32-bit) using surrogate pairs and that's where all the systems come from which don't validate surrogate pairs.
                • nitwit005 4 months ago
                  In this case, they correctly met a standard, and the standard changed.

                  If you look at OS-X there are similar issues. The Apple File System is case insensitive for a particular Unicode version.

                  • FirmwareBurner 4 months ago
                    >Microsoft never implements a standard

                    Win32 ?

                    • p_ing 4 months ago
                      Op likely meant other's standards, not ones they create themselves. Sort of like AD, even though AD is objectively better than the mishmash of Kerb + LDAP + policy implementations out there.
                      • account42 4 months ago
                        They don't even implement that "standard" correctly according to their own documentation.
                    • chrismorgan 4 months ago
                      Nothing uses well-formed UTF-16. I don’t think I know of a single piece of software or library that uses 16-bit code units that validates them.

                      In practice, “UTF-16” means “potentially-ill-formed UTF-16”. It’s that simple.

                      • layer8 4 months ago
                        Historically, this is because Windows NT used UCS-2 [0]. Unicode only moved to beyond 65536 characters, and introduced the concept of surrogate pairs, with Unicode 2.0 in 1996.

                        [0] https://www.unicode.org/faq/utf_bom.html#utf16-11

                      • feldrim 4 months ago
                        Hi all. OP here. I added a Postscriptum about the surrogte pairs and their status in Linux. I used WSL to access those files under Windows, and generated the same on Linux. You can see that behavior differs on the same file names:

                        1. On Windows, accessed by WSL

                        2. On Linux (WSL), using UTF-8 locale

                        3. On Linux (WSL), using POSIX locale

                        The difference is weird for me as a user. I'd like to know about the decisions made behind these. If anyone has information, please let me know.

                        • formerly_proven 4 months ago
                          The Linux section just seems to be artifacts of the WSL hacks, it has nothing to do with how Linux filenames function. Those are simply bags of bytes, the encoding only matters for displaying them, and isn't interpreted internally. ls failing to access the .exe is clearly a WSL filesystem issue and not a Linux / ls issue. You also can't set a UTF-16 locale because that's not what a locale is. UTF-16/32 vs SBCS and UTF-8 is the wide/narrow character distinction, which is a whole separate thing, different ABIs, different APIs.
                          • feldrim 4 months ago
                            WSL-to-Windows, yes, it is due to translations. But within the WSL, not sure. I'll try to replicate them on a Ubuntu VM for comparison.
                          • p_ing 4 months ago
                            Your subscript 2 applies to NTFS. The only characters NTFS does not allow are NUL and "/".

                            Beyond that, it is up to the API you're choosing to use to read the volume. Win32 has of course many more restrictions than POSIX would, but since Windows NT supports multiple personalities, you could still RW illegal Win32 characters under NT, e.g. with SFU.

                            • zombot 4 months ago
                              WSL is not Linux, despite whatever Microsoft says.
                              • p_ing 4 months ago
                                WSL is Linux -- it's an automatically managed VM with some special sauce for connectivity between the parent partition and guest.
                                • anonymfus 4 months ago
                                  WSL2 is what you describe, WSL1 is not.
                            • kzrdude 4 months ago
                              I remember that in Mac OS X times, sometime between OS X v10.1 and 10.4, a system upgrade caused a bunch of unicode named files to become inaccessible/untouchable (but still present with a directory listing). At the time I didn't have the skills to figure out what had happened. I'm still curious to know if it was an intended breaking change.
                              • kps 4 months ago
                                OS X (at the POSIX level) assumes UTF-8 and normalizes file names to decomposed form (NFD). If for example you `date >$'\xC3\xBC'` (i.e. ‘ü’), then the actual stored file name is `$'\x75\xCC\88'` (i.e. ‘ü’ — assuming HN or my browser don't normalize!) and both `cat $'\xC3\xBC'` and `cat $'\x75\xCC\88'` (or ‘ü’ or ‘ü’) both work.
                                • kzrdude 4 months ago
                                  Unrelated to my question above, I think
                              • mofeien 4 months ago
                                Hi, thanks for the interesting submission!

                                I was a bit confused by the detour via utf-8 to arrive at the code points and had to look up UTF-8 encoding first to understand how they relate. Then I tried out the following

                                  candidate = chr(0xD800)
                                  candidate2 = bytes([0xED, 0xA0, 0x80]).decode('utf-8', errors='surrogatepass')
                                  print(candidate == candidate2) # True
                                
                                and it seems that you could just iterate over code points directly with the `chr()` function.
                                • feldrim 4 months ago
                                  I f I remember correctly, I tried that but in order to cover the exact range I need, the high and low surrogates, I picked this way out of practicality. It was just easier.
                                • qingcharles 4 months ago
                                  Aha! Found the name of my next album. Try downloading me on Napster now!
                                  • somewhereoutth 4 months ago
                                    Ah got caught by surrogate pairs recently:- javascript sees them as 2 chars when e.g. slicing strings, so it is possible to end up with invalid strings if you chop between a pair.
                                    • n_plus_1_acc 4 months ago
                                      I think it's hilarious that the event viewer XML gets borked.
                                      • feldrim 4 months ago
                                        I am not 100% sure but mmc.exe has not been updated for years and it must be relying on WebBrowser control of Internet Explorer. Yes, IE is still alive in Windows.

                                        https://learn.microsoft.com/en-us/previous-versions/windows/...

                                        • account42 4 months ago
                                          And we should all be thankful for that. Just imagine if all those system tools were as "useful" as the modernized windows settings.
                                      • ooterness 4 months ago
                                        Why does the Windows filesystem allow filenames with invalid strings?

                                        It seems obvious that attempts to create files with such filenames ought to be blocked.

                                        • feldrim 4 months ago
                                          It's mentioned in a comment here that the existing restrictions are due to Windows APIs and NTFS does not check file names in a restricted way. Therefore, if devs want to filter these out or not in the API, is another story.
                                        • ge96 4 months ago
                                          I remember a long time ago I accidentally put some symbol in a folder name like ? in Windows had problems
                                          • rob74 4 months ago
                                            Or otherwise said: Surrogate Pairs are used in UTF-16 (which uses two bytes per character, so it can encode up to 65536 characters) to encode Unicode characters that have code points that can't be encoded using just two bytes.
                                            • feldrim 4 months ago
                                              Yep. The quirk here is that the surrogates, that are merely enablers for other characters, can be paired with each other. With the absence of other valid characters, they are not enabling anything. One assumes there is a validation but it does not exist here.
                                              • layer8 4 months ago
                                                There is no validation on the file system level because file names in NTFS are sequences of arbitrary 16-bit values, similar to how on Unix file systems, file names are sequences of arbitrary 8-bit values. Arguably the situation on Unix is worse, because there the interpretation and validity depends on the current locale.
                                                • feldrim 4 months ago
                                                  Totally. These are design choices made by development teams. But as users, we "assume" all are readable until one day we learn that it does not work that way. Until I came across this issue, I assumed them to be all valid, renderable characters.
                                            • Devasta 4 months ago
                                              Stuff like this is why UTF and any attempt at trying to encode all characters is a mistake.

                                              The real solution is to force the entire world population to use the Rotokas language of Papua New Guinea.

                                              • rurban 4 months ago
                                                No, the real solution is to follow the unicode security guidelines for identifiers. Esp. on linux, where the silly garbage-in, garbage-out mantra doesn't fly with identifiers, because identifiers need to stay identifiable.

                                                Apple HPFS did some things right. They did at least NFD. But linux insanities brought them back to -Whomoglyph attacks

                                                • wheybags 4 months ago
                                                  I still think we should have forced everything into a 32-bit char, with no distinction between codepoints and grapheme clusters. One press on backspace removes one char. Address of char 7 is base+7x4. String length is byte length x 4. cat /dev/urandom is a valid string, it's the font's job to deal with unknown byte values, if you just want to process the text you dont need to care. Everything about text processing becomes super easy like in the old ascii only k&r c example code. I'm not 100% certain, but I don't think there's a widely used language that couldn't be represented by that.

                                                  Of course, you lose round trip ability with legacy encodings, which is why we have the mess that is unicode. Oh and silly things like unicode flag emojis wouldn't work, but honestly maybe that would be for the best. Oh well, it's too late now so I guess we just accept it.

                                                  • ianburrell 4 months ago
                                                    Grapheme clusters are locale dependent. Also, if you aren't allowing combining characters, then you are going to need lots of extra codepoints. In some languages, like Indian ones, vowels are combining characters. Or there are languages where multiple code points produce grapheme cluster, like Hangul syllables. You are going to need a lot more code points to represent all possible strings. Text processing is going to be much harder cause there a thousand different representations of Hangul character.

                                                    Also, backspace is locale dependent. In some languages, backspace removes the accent, which makes sense with combining characters, and other it removes the whole character. Which is going to be fun when whole syllable is code point.

                                                    Languages are hard, there is no way to make them simple.

                                                    • jerf 4 months ago
                                                      For better or worse, thanks to emoji Zero Width Joiner support [1], we're well on our way to there being more than 4 billion potential Unicode "characters". 4 billion is only 32 bits and you start spending a few bits here on hair style and a few bits there on skin color and a few bits on "misc" and then allow arbitrary combinations of them into composite families [2] and you can burn through 32-bits fairly quickly.

                                                      I don't think we're there yet. I think if someone did make a complete list of "valid" emoji right now, which for the sake of argument I'll call "formally defined in the Unicode standard", it would even on an absolute scale look like we're a long ways away from a full 32-bits of valid combinations. But you have to think of this on the log scale because this is about "bits" and those four-person families are already quite a long ways along to a full 32 bits. It wouldn't take much more customization, or the formal addition of more people in a group, to get there.

                                                      And someone who knows more about Unicode than I do may be able to establish that there are already in the standard ways to get to more than 32 bits' worth of data in a single standardized glyph; I certainly wouldn't bet much against that already being true.

                                                      (Personally, I'll go with "worse". In hindsight, we should probably have frozen Unicode into the original Docomo (and the other phone company that had them) emoji necessary for interoperability, and then created the emoji as an extension into Unicode. It seems like it would be useful to "support Unicode" without having to come with the complete understanding of what is increasingly the most complicated "language" in Unicode; forget doing good Arabic rendering or trying to understand an ideographic language, the emojis blow all that complexity away now. But here we are.)

                                                      [1]: https://unicode.org/emoji/charts/emoji-zwj-sequences.html

                                                      [2]: https://www.unicode.org/reports/tr51/#Multi_Person_Groupings

                                                      • jerf 4 months ago
                                                        I forgot about "Zalgo". You can definitely get more than 4 billion glyphs, indeed a great deal more, by stacking on modifiers.

                                                        I can see an argument that that's not really a "valid" use case that we need to worry about too much, though. Emoticons are well on their way to having more possible fully legal, fully intended outputs that go beyond what 32 bits could specify.

                                                      • extraduder_ire 4 months ago
                                                        >Oh and silly things like unicode flag emojis wouldn't work, but honestly maybe that would be for the best.

                                                        Why not? They're just two (or more) characters from a special set next to each other that a font may combine. (and some ad-hoc ZWJ sequences) I don't think windows even ships a font that does that by default.

                                                    • theiebrjfb 4 months ago
                                                      Yet another reason to use Linux everywhere. It is 2025 and Windows (and probably Mac) users have to deal with weird Unicode filesystem issues. Good luck puting Chinese characters or emoticons into filenames.

                                                      Ext4 filename has maximal length 255 characters. That is the only legacy limit you have to deal with as a Linux user. And even that can be avoided by using more modern filesystems.

                                                      And we get filesystem level snapshots etc...

                                                      • layer8 4 months ago
                                                        You have the same, if not worse, issue on Linux with filenames that aren’t valid UTF-8 sequences. Not to mention that on Linux switching the locale may change the interpretation of filenames as characters, which isn’t the case with NTFS.
                                                        • kwertzzz 4 months ago
                                                          > Not to mention that on Linux switching the locale may change the interpretation of filenames as characters, which isn’t the case with NTFS.

                                                          If you change the locale to an uninstalled one, then yes. But if the locale is installed, then I don't see a problem.

                                                          echo $LANG

                                                          # output: en_US.UTF-8

                                                          touch fusée.txt

                                                          LANG=fr_FR.UTF-8 ls

                                                          # output: 'fus'$'\303\251''e.txt'

                                                          sudo locale-gen fr_FR.UTF-8

                                                          sudo update-locale

                                                          LANG=fr_FR.UTF-8 ls

                                                          # output: fusée.txt

                                                          Are you maybe using non-UTF-8 locale?

                                                          • layer8 4 months ago
                                                            Yes, I mean locales like fr_FR.ISO-8859-15, ja_JP.SJIS or zh_CN.GBK.

                                                            While these probably aren’t used much anymore, it still means that your filenames can break just by setting an environment variable. Or issues like here: https://news.ycombinator.com/item?id=16992546

                                                        • feldrim 4 months ago
                                                          I see two points here. First, you did not read the article and did not see the footnote that these are valid in Linux as well.

                                                          Second, your comment shows you are lacking the knowledge on Linux as well. In Linux, as I have written in the foot note, accepts anything but 0x00 (null) and 0x2F (“/”). Other than that, all characters are valid paths. If you consider these a problem, I'd like to remind that the 2048 surrogate pairs is a really small subset of unrenderable combinations allowed in Linux.

                                                          Anyone are free to have their opinions but at least, before making bold claims, please do your due diligence.

                                                          • skissane 4 months ago
                                                            > In Linux, as I have written in the foot note, accepts anything but 0x00 (null) and 0x2F (“/”)

                                                            POSIX 2024 encourages (but doesn’t require) implementations to disallow newline in file names, returning EILSEQ if you try to create a new file or directory with a name containing a newline. Thus far Linux hasn’t adopted that recommendation, but I personally hope it does some day.

                                                            For backward compatibility, it would have to be a mount option. It could be done at VFS level so it applies to all filesystems.

                                                            Personally I would go even further and introduce a “require_sane_filenames” mount option, which would block you (at the VFS layer) from creating any file name containing invalid UTF-8 (including overlong sequences and UTF-8 encoded surrogates), C0 controls or (UTF-8 encoded) C1 controls.

                                                            Also I think it would be great if filesystems had a superblock bit that declared they only supported “sane filenames”. Then even accessing such a file would error because it would be a sign of filesystem corruption.

                                                            • feldrim 4 months ago
                                                              This I did not know. I know that ZFS has "utf8only" option, but not sure about others.