A decade of major cache incidents at Twitter
138 points by Smerity 3 years ago | 26 comments- jsty 3 years agoMajor incidents aside, I always think that cache-related bugs are some of the most likely to go undetected since if you don't test for them end-to-end, they're really not that easy to spot & diagnose.
An article sticking around too long on the home page. Semi-stale data creeping into your pipeline. Someone's security token being accepted post-revocation. All really hard to spot unless (1) you're explicitly looking, or (2) manure hits the fan.
- daper 3 years agoI categorize this as bugs caused by data inconsistency because od data duplication. That includes:
- Using asynchronous database replication and reading data from database slaves - Duplicating same data over multiple database tables (possibly for performance reasons) - Having additional system that duplicates some data. For example: in the middle of rewriting some legacy system - a process that was split into phases so functionality between new and old systems overlap for some period of time.
Based on my experience I always assume that inconsistency is unavoidable when the same information is stored in more than one place.
- rightbyte 3 years agoMicrosoft has some serious problem with token caching. I changed job last month and for two three weeks I could log into my old work account for a split second before being thrown out. (By habit visited the page). I could see the news feed and mails but not long enought to see if they were stale.
- daper 3 years ago
- terom 3 years agoRequired reading for all of the "I could code up Twitter in a weekend" -types.
The long listen queue -> multiple queued up retries feedback loop is a classic: https://datatracker.ietf.org/doc/html/rfc896 TCP/IP "congestion collapse" and the 1986 Internet meltdown [various sources]
- YEwSdObPQT 3 years agoIt ultimately depends at what scale. I think most people to be fair are talking about building a clone of what was present back in the mid 2000s. You could build a twitter clone that could handle a few hundred users in a few weeks with modern tech stacks.
The fact that there are at least 3 twitter clones that are less well put together with a decent amount of users handling in the load proves that it is possible.
- acdha 3 years ago“A few weeks” sounds a lot longer than a weekend, and I’d also consider the history: Twitter itself was built quickly using a modern stack. Rails was highly productive, the problem is that the concept of the service makes scaling non-trivial. We have more RAM and SSDs now so you could get further but those aren’t magic.
- tluyben2 3 years agoTwitter was down all the time for hours after the launch. I don’t think I have an issue coding something that goes down when it overloads with functionality twitter had when it launched in a weekend. Most work for that kind of project goes into interpreting the specs/your business colleagues and fixing the mishaps; here you don’t have that.
- YEwSdObPQT 3 years agoI agree. I was playing devil's advocate to an extent. At the time Rails was the first modern MVC framework as we understand it.
> Rails was highly productive,the problem is that the concept of the service makes scaling non-trivial.
Didn't they rewrite everything in PHP during the last 2000s due to Rails at the time just not being able to scale?
- tluyben2 3 years ago
- acdha 3 years ago
- tluyben2 3 years agoThose remarks are always made at launch, not later on. Dropbox and twitter, both of which people said this about, were rather trivial at launch especially with modern tooling. They also, and especially twitter, had growing pains. Twitter defo prioritised move fast and break things.
Obviously copying decades of improvements and scaling lessons you cannot copy unless someone made a product of those parts and you can use those.
- YEwSdObPQT 3 years ago
- Smerity 3 years agoWhat I find most interesting in this is the pseudo detective story of hunting down disappearing post-mortem and "lessons learned" documentation. Optimistically we'd hope that perhaps the older systems no longer reflect the existing systems in any meaningful way (possibly as the org structures and/or software stacks shift and change) and they're no longer relevant.
I'd imagine most lost knowledge is not an explicit decision however which means such historical scenarios / documentation / ... are just lost as part of business. Lost knowledge is the default for companies.
Twitter is likely better than most given their documentation is all digital and there exist explicit processes to catalogue such incidents. I'd also be curious to see how much of this knowledge has been implicitly exported to their open source codebases.
- jka 3 years agoWhat you've said is, in my opinion, likely to be a difference between the technology companies that become tomorrow's infrastructure and the ones that disappear (even if it takes decades).
As you say, the default tendency in many companies when failures occur is information-loss. That can be attributed to using too many communication tools, cultural expectations that problems should be hidden, silo'd or disparate documentation stores, or lack of process.
Intentional, open, thorough and replicated note-taking with cross-references before, during and after incidents can create radically different environments which allow for querying, recovery and improvement regardless of failure mode(s). Kudos to Dan for moving in that direction with these writeups (and to you for raising the subtext).
- jka 3 years ago
- plasma 3 years agoI remember reading Facebooks caches had a dedicated standby set of “gutter” servers that would take over a failure quickly (otherwise inactive and unused) that was an interesting mitigation for some failure scenarios.
- Jach 3 years agoThese big incidents involving 'big cache' are fun to read about. Years ago I had to deal with a bunch of cache issues over a short time, but they were all minor incidents with minor uses of cache (simple memoization, storing stuff in maps on attributes of java singletons, browser local storage). Still, I made a checklist of questions to ask thenceforth on any proposal or implementation of a cache in a doc or code review. A bunch of them are just focused on actually paying attention to what your keys are made of and how invalidation works (or if you even can invalidate, or if it's even needed). I think for 'big cache' questions I should just refer to this blog post and ask "what's the risk of these issues?"
- wizwit999 3 years agoYeah, see also, Marc Brooker has a good article on why the bimodal behavior of caches can cause a lot of headaches https://brooker.co.za/blog/2021/08/27/caches.html
- mprovost 3 years ago"There are only two hard things in Computer Science: cache invalidation and naming things." -- Phil Karlton
- dspillett 3 years agoThere are only two hard things in Computer Science: cache invalidation, naming things, and off-by-one errors.
- Kavelach 3 years agoThere are only two hard problems in Computer Science: there's one joke, and it's not even funny
- capableweb 3 years agoI prefer the less obvious version of this one:
> There are only three hard things in Computer Science: cache invalidation and naming things.
- giantrobot 3 years agoThat's the one I use all the time.
- Kavelach 3 years ago
- dspillett 3 years ago
- spoonjim 3 years ago“ On Nov 8, a user changed their name from tigertwo to Woflstar_Bachi.”
Horrifically inappropriate inclusion of PII in this post. Didn’t someone at legal go through this?
- formerly_proven 3 years agoIt's a current, public profile:
> Wolfstar_Bachi @tigertwo
> Wolfstar is an online and social media PR agency that specialises in helping some of the world’s best companies to communicate more effectively.
- formerly_proven 3 years ago