Colliding with the SHA prefix of Linux's initial Git commit
227 points by 2bluesc 6 months ago | 64 comments- hn_throwaway_99 6 months agoGreat example of Hyrum's Law, https://www.hyrumslaw.com/.
Comments about SHA256 are irrelevant - you can misuse the prefix of a SHA256 hash just as easily. The issue is that people got used to human-readable hash prefixes of 10-12 characters as "unique" for all intents and purposes, despite the fact that there were never any uniqueness guarantees for prefixes and git has always handled collisions with short object IDs as ambiguous - it's just that it's so rare to happen in the real world that lots of script writers treated that "mostly unique" as a guarantee.
IMO support for short object IDs is a mistake, as is any behavior that "works this way 99.999% of the time, but hey developer don't forget you need to also code for that .001% edge case". I'm always just copying and pasting things around anyway, so it really doesn't make much difference to me if I'm copying 12 chars or 64.
- poincaredisk 6 months ago>so it really doesn't make much difference to me if I'm copying 12 chars or 64.
It doesn't have to be 12 characters. I often type short git hashes - maybe 5 characters - when jumping between commits.
- poincaredisk 6 months ago
- throw0101d 6 months agoFirst twelve and last twelve characters are the same:
* Via: https://news.ycombinator.com/item?id=38668893$ echo -n retr0id_662d970782071aa7a038dce6 | sha256sum 307e0e71a409d2bf67e76c676d81bd0ff87ee228cd8f991714589d0564e6ea9a - $ echo -n retr0id_430d19a6c51814d895666635 | sha256sum 307e0e71a4098e7fb7d72c86cd041a006181c6d8e29882b581d69d0564e6ea9a -
- Retr0id 6 months agoOh hey, it's me. Shortly after then I did a write-up on how I crafted them, which was discussed here: https://news.ycombinator.com/item?id=38718314
- susam 6 months agoExcellent article. Thanks for resharing it here.
- susam 6 months ago
- Retr0id 6 months ago
- Terr_ 6 months agoThere were some plans to migrate to SHA256, but somehow it still hasn't happened.
The practical upshot is a git commit hash is not enough l to know you are distributing/executing the legitimate code, as opposed to a malicious doppelganger. This is particularly important for tools that rely on it for dependency management, local caches, etc.
- usr1106 6 months agoTFA has nothing to do with SHA-1 or SHA-1 collisions. It's about abbreviated hash values introduced for readability by humans. Now these values are used by auxiliary scripts. Which again has little to do with git proper. It's just what the kernel community writes into commit messages and what scripts they use to parse those messages.
- TheDong 6 months ago> The practical upshot is a git commit hash is not enough l to know you are distributing/executing the legitimate code, as opposed to a malicious doppelganger
Really now? Mind if I challenge you?
I have on my machine a git repo with commit '75eb4e3b1369706a4dcd61cc80e49660ac341ea4'.
If you can give me a second git repo with such a commit containing different contents, I'll happily send you $10k USD, or donate it to a charity of your choice.
- sgjohnson 6 months ago> If you can give me a second git repo with such a commit containing different contents, I'll happily send you $10k USD, or donate it to a charity of your choice.
Calculating that SHA1 collision is going to be a bit more expensive than $10k, by a couple of orders of magnitude.
Finding it in the wild is improbable, but calculating it is definitely possible, and has been done before. http://shattered.io/
- chippiewill 6 months agoShattered didn't produce a collision for an arbitrary hash, it produced two documents with the same hash (which is a slightly easier problem, about 100,000x faster).
SHA1 is certainly insecure at this point, but not even close to trivially so.
- chippiewill 6 months ago
- sgjohnson 6 months ago
- onedognight 6 months agoThe is not a full git commit hash collision. It has to do with a git note which only needs to matche a 12 character prefix of the git commit.
- usr1106 6 months agoWhile you corrected one mistake you added a new one:)
Those are git trailers, see git-interpret-trailers(1).
git-notes(1) is something completely different and not used by the kernel.
- usr1106 6 months ago
- theamk 6 months agopeople don't really care, because current collision methods are mitigated.
Git does not actually use "sha1", despite what all the docs say, it uses "sha1dc", which is just like sha1 except for inputs which can cause collisions, in which case it either fails with clear error message or returns completely different value.
https://news.ycombinator.com/item?id=17825441
so don't worry, git hashes _are_ enough to know you are distributing/executing the legitimate code.
(not to mention you need a preimage attack to replace known commit, and this is not yet possible with sha1)
- akoboldfrying 6 months agoFrom your link:
>In this case "hash" will be the same as SHA1(input) in all cases, except those where the input is detected to be malicious (as in the SHAttered attack)
I don't see how this can be more than a fundamentally forward-incompatible sticking plaster over the problem. The problem isn't merely that "detecting maliciousness" seems fraught in itself (how does one infer intent reliably?) -- it's that today's SHA1DC() implementation can only detect and optionally correct today's known attacks, so each new attack necessitates a new, incompatible version of SHA1DC().
- theamk 6 months agoEach new _unrelated_ SHA1 attack will need an update in SHA1DC. But it has to be truly unrelated, as the collision detection method is fairly robust. I recommend reading the original "Counter-cryptanalysis" paper [0] for details on how attacks work and how they can be mitigated (there is certain internal state in SHA1 that is used in all known attacks). BTW, this paper has an interesting anecdote: apparently Flame malware had exploited MD5 collisions using novel unpublished attack method... and yet it was detected by collision detector (section 3.2). Another example is that SHA-mbles attack, published 3 years after "Counter-cryptanalysis", was detected as well, with no required code changes.
No, there is nothing "fundamentally incompatible" in the new SHA1DC method. After all, git came out in 2005, 11 years before SHA1 attacks were known, so it used regular SHA1. The collision detector was added in 2017 and nothing broke, because false positive chance is 2^-90 [1].
I have not heard of any new SHA1 collision results, but if they are based on no-difference differential paths, git has nothing to worry about. And if they are not, it may be possible to extend DC detector to seamlessly detect and prevent those attacks, and then only upgrade git clients, keeping backward and forward compatibility for data.
Of course there is always a chance that someone will come out with all-new SHA1 preimage attack that cannot be detected without high rate of false positive, so it's prudent to switch git to sha256. There is a lot of work being done: git's sha256 mode went out of beta in 2.42 (2023), but neither github nor gitlab support it.
But since the current state of git's sha is that there is nothing broken, and git git commit hash _is_ "enough to know you are distributing/executing the legitimate code, as opposed to a malicious doppelganger", there is no real pressure.
[0] https://marc-stevens.nl/research/papers/C13-S.pdf
[1] https://github.com/cr-marcstevens/sha1collisiondetection
- bawolff 6 months ago> The problem isn't merely that "detecting maliciousness" seems fraught in itself (how does one infer intent reliably?)
Its not detecting "intent" it is detecting that the hash is one vulnerable to the attack, which is extremely unlikely to happen by accident, so if you see it you can assume malice.
It might be a band-aid, and sha256 is certainly a much better solution, but its more robust than it sounds at first glance (since it sounds crazy at first glance)
- Dylan16807 6 months agoSure it's plaster, but plaster can last a good while. Sufficiently new attacks don't come around all that often, and every hash has a risk of new attacks showing up.
- theamk 6 months ago
- akoboldfrying 6 months ago
- oefrha 6 months agoSwitching to SHA-256 and switching to longer substrings of hashes for identification are basically orthogonal problems. The former is hardly going to help with the latter, except in the we already broke everything so why not take the chance to break some more sense.
- shakna 6 months agoCompatibility between remotes using one or the other hasn't arrived yet, and git doesn't want to break compatibility. But you can create SHA256 one's today. [0]
- bandrami 6 months agoThe hash space is atoms-in-the-universe range; this is a collision in a much, much smaller subset of that space
- dec0dedab0de 6 months agoI think multiple hashes is the way to go to avoid collisions. it can even be something simple like md5. the chances of finding a collision that matches two or more algorithms is near impossible. Obviously that doesn’t work for passwords, but for verifying that data hasn’t been tampered with, it works.
- usr1106 6 months ago
- philips 6 months agoRelatedly: Kees's keynote on Linux security from a month ago was great: https://www.youtube.com/watch?v=orO8czP5Bxw
- chrishill89 6 months agoPresumably the problem is that these tools only take the abbreviated hash into account. Not also the subject:
You also have another data point. You only need to search in the history from the commit that you are reading. Assuming that the "Fixes" commit is an ancestor of the commit whose commit footer you are reading.<abbrev. hash> ("<subject>")
I always just assumed that tools would take all the data into account. Which means that you both need to collide with the abbreviated hash as well as the subject. Now I don't do that since I just copy-paste the hash, but I would quickly notice in case the subject is different (and likely the commit message and the diff just look irrelevant).
I don't understand why the Linux Kernel has this hard-coded rule[2] -- again, you were going to get collisions eventually, so the tools should have just taken all the data into account (at least the subject) from the start. The recommendation in the Git project is to use `git show -s --pretty=reference`, without any fiddling with the abbreviation:
Although the Git maintainer uses `--abbrev=8` since git-show will just use a longer abbreviation in case the output would be ambiguous[1].<abbrev. hash> (subject, ISO date)
They could have used this instead if they wanted simpler, future-proof tooling:
Just like tools like git-revert and git-cherry-pick do.Fixes: <full hash>
[1]: https://lore.kernel.org/git/xmqq34j5h7v9.fsf@gitster.g/
[2]: Edit: hard-coded as opposed to Git just figuring out how long the abbreviation should be based on how many objects there are.
- chrishill89 6 months ago> Presumably the problem is that these tools only take the abbreviated hash into account. Not also the subject:
Well the first mentioned script:
> > Tools like linux-next's “Fixes tag checker”,
has `get_full_hash`[1] which uses the subject to search through the abbreviated matches.
Edit: And that check was added two weeks ago by Kees [2].
[1]: https://github.com/kees/kernel-tools/blob/trunk/helpers/chec...
[2]: https://github.com/kees/kernel-tools/commit/5bf6a1e71df59a23...
- weinzierl 6 months agogit itself does not use a fixed size abbreviation but determines the length necessary with the birthday paradox formula. It just happens to be seven characters most of the time, because most repos are small.
This is just for the abbreviated hash which git uses only for specific cases like display in the git log and similar, where it is a user experience improvement and relatively safe.
- chrishill89 6 months agoI know. I don’t understand why the Linux Kernel uses a hard-coded abbreviation instead of just letting Git figure out how long it should be (edited my comment now).
- Too 6 months agoThe length of the required abbreviation grows over time, while content you put in commit messages sticks around. Meaning a commit done in year 2005 may have gotten away with "Fixes: abcd" and was unambiguous at that time, whereas after 20 years of growth, there could now be multiple other commits with that prefix.
By that logic, any length of abbreviation will eventually fail if stored for a long time, those abbreviations should be seen as temporary for user interaction only. For the "Fixes:" footer they should have gone with full hash, it's hidden away in the footer so it doesn't disrupt anyone. Aanyone interacting with it will double click to copy it regardless if it is 8 or 40 characters long. Abbreviating it simply adds no value.
- Too 6 months ago
- dotancohen 6 months ago> with the birthday paradox formula.
At what probability? Can that probability be configured as a global Git option?
- Dylan16807 6 months agoSo as noted you can configure the length, but you can't actually adjust the probability used by the auto-length formula.
The formula calculates how many bits are needed to store the approximate number of objects, then uses twice that many bits, rounding up.
For example, at 16000 objects git is still using the minimum length of 7 characters. But at 17000 objects it now takes 15 bits to store the object count, so it wants 30 bits of hash, which means 8 characters.
- cesnja 6 months agoIt can be configured with `core.abbrev`.
https://git-scm.com/docs/git-config#Documentation/git-config...
- Dylan16807 6 months ago
- chrishill89 6 months ago
- chrishill89 6 months ago
- loeg 6 months agoCute. I've seen CPU commit prefix brute-forcers but not GPU ones. On the CPU, you can generate chosen 7-8 digit prefixes pretty easily (within a few minutes, I forget exactly). 12 digits is, of course, a massively bigger (factor of 2^16) space.
- timewizard 6 months agoWould tooling break if the radix was changed from 16 to something higher? You could double your resistance by just changing to base 32 representation in the same length
- Dylan16807 6 months agoSo much would break if you changed the character set. And there would be pretty much no benefit. Making it a hundred times harder* is not nearly enough to prevent short collisions.
* The exact amount would depend on repo and method. Using base 32 and changing nothing else would slow collisions by 2^length.
- timewizard 6 months ago> would slow collisions by 2^length.
It would add a bit to each position, going from the current 48 bits, to 60 bits. Or 4,096 more resistant to _accidental_ collision. You can have as many bits as you want you just make it more time consuming to create one, but if you're entire infrastructure blithely relies on one never _intentionally_ being submitted, you've solved nothing really.
Moreover, if you find a commit with a collision, it's very easy to slightly alter it's contents to alleviate the problem, presuming we're only having to contend with unintentional conflicts.
- Dylan16807 6 months agoThe length of the short hash automatically adjusts to keep the chance of accidental collisions low. There's no need to lower the risk of accidental collisions.
So changing the character set doesn't notably improve the accidental situation, and it doesn't notably improve the deliberate situation. And it would be a huge pain to do.
- Dylan16807 6 months ago
- timewizard 6 months ago
- Dylan16807 6 months ago
- quotemstr 6 months agoWe could just encode hashes on base64 instead of hex and get several times more entropy with the same size text
- userbinator 6 months agoIn other words, 6 out of 20 bytes match. Good luck colliding the other 112 bits.
- bandrami 6 months agoThe problem is some (broken and wrong but in the wild) tooling relies on substrings like that
- Dilettante_ 6 months agoSomebody ought to tell them not to do that! [Spoken in the tone of 'You can't park there sir!']
- Dilettante_ 6 months ago
- bandrami 6 months ago
- cwillu 6 months agohttps://people.kernel.org/kees/colliding-with-the-sha-prefix... is the actual link; the lwn article is just a link to it with a one sentence summary and a short quote.
- dang 6 months agoChanged. Thanks!
(Submitted URL was https://lwn.net/Articles/1003797/)
- dang 6 months ago
- james246henry 6 months ago[dead]