DoorDash manages high-availability CockroachDB clusters at scale

52 points by orangechairs 1 year ago | 61 comments
  • gizmo 1 year ago
    DoorDash has about 35 million users, and there is zero interaction between users. The median user uses doordash maybe once a week. So 5 million sessions a day, all happening in the same 3 hour window. That's 2 million sessions per hour at peak times.

    How does DoorDash get to 1.2 million queries per second. 1.2mqps * 10000 seconds in 3 hours = 12 billion queries to process 5 million orders? That's wild. Is it all analytics? This is highly suspect. 35m users isn't nothing, but it isn't exactly Facebook scale either.

    • BrentOzar 1 year ago
      I’m not excusing the wild number, but just tossing out some additional load: * Drivers checking in for work, especially if the apps poll automatically * Drivers phoning home with live location updates * Restaurants sending automated updates on order status * Push notifications to users with status changes on their orders * Users with multiple devices (like I have at least 5 devices with the UberEats app)
      • jakjak123 1 year ago
        Yes, our server had 120k queries/sec, but 80% of that traffic was driver heartbeats or connection verification. We halved it by disabling the connection verification query.
        • javawizard 1 year ago
          Hold up, what do you mean by "our server"? Do you work for DoorDash?
        • cdchn 1 year ago
          Even just searching and browsing restaurants and menus is probably dozens of queries for every interaction.
          • gizmo 1 year ago
            Serializing json is pretty expensive at scale. I would be shocked if restaurant and menu json doesn't get cached aggressively.
          • 1 year ago
          • scottlamb 1 year ago
            > 12 billion queries to process 5 million orders?

            2,400 queries per order? That's not that crazy IMHO. There might be significant database fan-out on each click (depending on how they do geographic lookups, search ranking / synonyms / sponsored stuff, the repeat your last order features, whether the ranked search returns the full object or a reference that then has to be individually queried, etc.). There might be many clicks per order because people browse a lot (both to find a restaurant then to find dishes within the restaurant), leave reviews, poll for delivery status updates, etc.

            • gizmo 1 year ago
              That's fair but that also suggests most actions hit the main database directly instead of caching layers. Possible, but somewhat unusual at this scale.
              • scottlamb 1 year ago
                In quorum systems like CockroachDB, non-leaders provide tons of extra capacity for eventually consistent reads. [edit: maybe a bit less so in a big database because at any instant one machine should be a leader for some shards and non-leader replica for others.] It's not always worth the complexity of having a high-hit-rate cache in front of that. Maybe no cache is needed, or just one to mitigate the worst of the hot spots.
              • beembeem 1 year ago
                > 2,400 queries per order? That's not that crazy IMHO.

                Isn't that off by at least an order of magnitude though? It forces them to operate a much larger cluster than should be necessary.

                • scottlamb 1 year ago
                  > Isn't that off by at least an order of magnitude though?

                  No, for all the reasons I just said?

                  > It forces them to operate a much larger cluster than should be necessary.

                  How much machine cost and operational effort do you imagine they would save if they reduced the qps by a factor of 10 without changing the number of regions, number of tables, or size of the data? How much SWE time do you imagine that'd take to do and maintain?

                  I've run a global Paxos-based database that received two orders of magnitude more qps than this. It cost less than you're probably imagining. I sometimes hunted down silly queries, but mostly leader ops, and mostly to mitigate hot spots or as a quixotic latency reduction effort...overall, this was the cheapest layer of the system.

                  A query to a well-implemented OLTP database is not like a request to some Python/PHP/Ruby app.

              • rbranson 1 year ago
                Given how these blog posts are typically written, it is very likely the 1.2 million QPS figure is an all-time peak, not anything like an average.
                • orangechairs 1 year ago
                  According to their slides/video (bottom of the blog post), the 1.2 million QPS is their daily peak number, not the average.
                • beoberha 1 year ago
                  It’s naive to think that database access is only happening when a customer makes an order. Each driver has workflows they exercise and data that needs stored. Same with vendors. There could be operational data for their infrastructure.

                  That said 1.2 million queries per second is wild. Would be interesting to see the breakdown.

                  • jen20 1 year ago
                    > there is zero interaction between users

                    A curious description for a platform which acts as a broker for transactions between users!

                    • indymike 1 year ago
                      Massive amounts of user tracking.
                      • Thaxll 1 year ago
                        Your numbers are off by a large factor.
                        • milkglass 1 year ago
                          It's all the new reminders telling users to tip.
                        • xyst 1 year ago
                          > About 1.2 million queries per second at daily peak hours.

                          > About 2,300 total nodes spread across 300+ clusters.

                          > About 1.9 petabytes of data on disk.

                          > Close to 900 changefeeds.

                          > Largest cluster is currently 280 TB in size (but has peaked above 600 TB), with a single table that is 122 TB.

                          all of this yet my food still arrives cold af

                          kidding aside, I wonder if DD has the same problems as Uber or Lyft except with food delivery. Each new "change feed" is a specific region, county/municipality, or city. Federal, state, and local laws all handled delicately.

                          • orangechairs 1 year ago
                            DoorDash's engineering blog has a much more indepth look at their architecture: https://doordash.engineering/2023/02/07/how-we-scaled-new-ve...
                            • beembeem 1 year ago
                              > my food still arrives cold af

                              Ha.

                              The first thing I noticed and you almost got to it in your summary: at 1.2MM/2300 = 520 qps per node, this isn't a wild setup. I'm wrapping my head around how they're generating that amount of load. Seems like an easy task for any database to handle.

                            • rickreynoldssf 1 year ago
                              I'm not really seeing why DoorDash needs all their operational data in one monster clustered database. I would think its so much simpler to shard the data by region for operational queries and aggregate in the background for long-term storage.
                              • jordanthoms 1 year ago
                                Sharding is anything but simple. A single shard per region wouldn't have enough write capacity so they'd have to be managing likely 100+ shards in each region - you'd have to build a lot of infrastructure to automate setting those up, rebalancing traffic to avoid hot spots and underutilized shards, in sync with schema migrations etc.

                                Even after that, now your applications using the DB have to be aware of the sharding - interactions between users who are housed on different shards etc could require a lot of work at the application layer. If your customers can be easily be split into tenants which never interact with each other this isn't so bad but for a consumer app like DoorDash there isn't clear tenant boundaries.

                                We looked at all this for Kami and realised that it would be much easier for us to move from PostgreSQL to CockroachDB (we had exceeded the write capacity of a single PostgreSQL primary) than to shard Postgres, and it'd make future development much faster. We could have made sharding work if we had to... but it's not 2013 any more and we have distributed SQL databases, why not use them?

                                • cellularmitosis 1 year ago
                                  That’s surprising — the education market seems like an even better fit for sharding: students and teachers generally stay within the context of a single school.
                                  • jordanthoms 1 year ago
                                    It does seem like there would be a clean boundary between each school district, but actually there's plenty of sharing and collaboration on Kami that happens with users between districts, teachers and students move schools, parents can have children in different school districts, etc. Even a single classroom assignment can cross those, e.g. when someone external comes in to do a training session.

                                    We could have modified our application layer to handle those cases, but it's a lot of extra complexity and room for error, and we'd have had to consider and solve for all of these cross-tenant situations as we add new functionality, so I was really keen to avoid that.

                                    Also, there are some really big districts - NYCDOE has >1.1 million students and 1,800 schools. Even with them on a dedicated shard, it's quite possible that it'd get overloaded and we'd be spending more dev effort figuring out how to safely split them onto multiple shards.

                                    When we looked at using distributed SQL database instead it was a clear win - from the application's perspective, it just looks like a really, really big PostgreSQL box, so we didn't need to change much. (the SQL support is very close to PG - The most annoying thing for us was the lack of trigram indexes, and Cockroach has now added those now). And in terms of the operational side, upgrading and maintaining CRDB has actually been easier than PG - version upgrades are easier to do without downtime, and schema migrations don't lock tables.

                                • JohnBooty 1 year ago
                                  I've never done geographic sharding but it seems kind of hard. How do you pick shard boundaries? How do you deal with entities who are near the boundaries and whose current operational data therefore spans >1 shards? (Imagine somebody at near the geographic intersection of like, five shards looking for pizza in a 10 miles radius or w/e)

                                  Also the majority of entities they're tracking (users, drivers) do not have fixed locations.

                                  Maybe it's not as hard as I'm thinking. I guess you just have to accept that any query can span an arbitrary number of shards and the results need to be union'd.

                                  I'm sure a lot of smart people have tackled this at the DoorDashes and Ubers of the world and maybe there's some optimal way of handling it. I would love to hear about that.

                                  • scottlamb 1 year ago
                                    Great points that show regional databases are not obviously simpler than one global database, especially from the application developer's perspective.

                                    > How do you deal with entities who are near the boundaries and whose current operational data therefore spans >1 shards? (Imagine somebody at near the geographic intersection of like, five shards looking for pizza in a 10 miles radius or w/e)

                                    Hitting 5 shards might not be that bad. I think you could divide the world into sufficiently large hexagonal tiles; you'd hit at most three shards then. Maybe each fixed-size tile is a logically separate database. Some would be much hotter than others, so you don't want to like back each by a fixed-size traditional DBMS or something; that'd be pretty wasteful.

                                    > Also the majority of entities they're tracking (users, drivers) do not have fixed locations.

                                    Yeah, you at least want a global namespace for users with consistent writes. The same email address belonging to different people in different regions is unacceptable. In theory the global data here could just be a rarely-updated breadcrumb pointing to which database holds the "real" data for that user. [1] So you can make the global database be small in terms of data size, rarely-written, mostly read in eventually consistent fashion, and not needed for geographically targeted queries. That could be worthwhile. YMMV.

                                    [1] iirc Spanner and Megastore have a feature called (something like) "global homing" that is somewhat similar to this. For a given entity, the bulk of the data is stored in some subset of the database's replicas, and bread crumbs indicate which. If you get a stale bread crumb, you follow the trail, so looking up bread crumbs with eventually consistent reads is fine. [edit to add a bit more context:] One major use case for this is Gmail. It has tons of regions in total, but replicating each user's data to more than 2 full replicas + 1 witness would be absurdly wasteful.

                                    [edit:] looks like CockroachDB has the concept of a per-row preferred region, which might also be vaguely similar. <https://www.cockroachlabs.com/docs/v23.1/table-localities#re...> I haven't used CockroachDB and only skimmed this doc section.

                                    • jeffbee 1 year ago
                                      I think the global homing feature you are describing belongs to applications built atop Megastore (and Spanner). That is, Megastore itself is agnostic to it, but the application knows how to resolve the homing.
                                      • JohnBooty 1 year ago
                                        Thanks for a thoughtful and informative reply, I learned some things.
                                      • jfim 1 year ago
                                        > I've never done geographic sharding but it seems kind of hard. How do you pick shard boundaries? How do you deal with entities who are near the boundaries and whose current operational data therefore spans >1 shards? (Imagine somebody at near the geographic intersection of like, five shards looking for pizza in a 10 miles radius or w/e)

                                        You could do it by market (eg. SFBA, Los Angeles, San Diego) or by state.

                                        • jordanthoms 1 year ago
                                          They would have to have many shards per city to keep up with the level of write traffic though. And what happens when a user from SFBA goes down to LA?
                                        • FinnKuhn 1 year ago
                                          You could probably just do it by continent as no one from SA would order anything from Europe but at that point each of those databases is probably big enough that you would need a similar solution as if you just had a single one
                                          • JohnBooty 1 year ago

                                                You could probably just do it by continent as no 
                                                one from SA would order anything from Europe
                                            
                                            Imagine a luxury version of DoorDash that does work this way. As I awake in my luxury palace in Sao Paulo, I realize that I would like some fresh grapes from the champagne region of France. With a few taps on my Luxury Door Dash app, a plane is on its way with my grapes.
                                        • mbyio 1 year ago
                                          Cockroach automates the sharding of data by region and provides tools that let you use and manage it more like a traditional database. If they didn't use cockroach, they would have to write/setup tools and adapters to do all that anyway. It would probably be more familiar to developers conceptually if they used traditional sharding, but why build and maintain all that when you can just use off the shelf software?
                                        • mike_d 1 year ago
                                          Because CockroachDB is a vendor that abstracts away all the thinking parts of running a database cluster. They do regional sharding, clustering, consistency, etc. for you.

                                          They could have just as easily dropped in Oracle. You pay for expensive DB up front, and can hire cheaper junior DBAs and developers going forward.

                                          • sciurus 1 year ago
                                            The article says they have 300+ clusters, not one monster one.
                                            • esafak 1 year ago
                                              Manual sharding is a crutch and a pain. Just use a distributed database, and let the database company worry about it.
                                              • killingtime74 1 year ago
                                                Resume driven development
                                                • ravenstine 1 year ago
                                                  But it's at scale!!!!!
                                                • sean0- 1 year ago
                                                  Ha! Amazing (didn’t know this was being written or put up).

                                                  This is a summary of a recent conference talk:

                                                  https://youtu.be/jCjrfpF64Kc?si=Gf-gp_ixX2V6Qz8V

                                                  This was my team. We did and lived this. AMA.

                                                  • sverhagen 1 year ago
                                                    Well, it looks like the sibling threads are very interested to know where the need comes from to even _have_ 1.2 million queries per second. How does that break down? How much is that just core functionality versus analytics and tracking?
                                                    • sean0- 1 year ago
                                                      That's core functionality, not analytics. Nearly all of that is browsing, ordering, location updates, etc. The dismissive comments are amusing and show a lack of understanding of how the business works and, subsequently, the technology required to power the end-to-end flow for users.
                                                      • sverhagen 1 year ago
                                                        Of course we don't know your business; we do back-of-envelope math based on our own experiences and of course a lot of assumptions about yours. Those numbers are just... impressive... or unbelievable... depending on where you're coming from.
                                                  • joshstrange 1 year ago
                                                    Do you know what DoorDash doesn’t manage? A staging/test environment. All testing for API integrations is done in prod on shared account. The docs and the API endpoints themselves leave a lot to be desired as well.
                                                    • jvans 1 year ago
                                                      I've been advocating for this approach for a long time. At some level of size it is so brutally difficult to maintain an environment that mirrors production that the effort isn't worth it. With enough tooling in place you can mitigate the risk to customers significantly
                                                      • stingraycharles 1 year ago
                                                        So I assume they use feature flags instead, and staggered rollout of new features? As that’s a common alternative to heavy up-front testing.
                                                      • snihalani 1 year ago
                                                        interesting. curious if anyone has benchmarked it relative to other dbs. like: https://benchmark.clickhouse.com/
                                                        • karmakaze 1 year ago
                                                          CrDB is not about many many low latency queries, like say MySQL. It's designed more for getting your workload processes down to making as few large queries as every one incurs quorum latencies. You don't want to prototype something in Rails and hope there's no hidden lazy queries happening along the way.

                                                          It wouldn't be a good idea to take a large working PostgreSQL app and try to switch over to using CrDB. You'd spend all your time (unwittingly rewriting the entire app) speeding up and grouping a few queries at a time.

                                                          • jordanthoms 1 year ago
                                                            We moved took a large working PostgreSQL app and switched it over to CRDB and that doesn't match my experience. Our existing schemas and query patterns moved over nicely - latency for small indexed reads and writes did increase from ~1ms to ~3ms, but the max throughput now effectively unlimited since we could add capacity by adding new nodes into the cluster and letting CRDB automatically rebalance the workload. There was an increase in cost as it will need more cores, disk etc compared to a single-primary PostgreSQL, but that makes sense when you consider that every bit of data is getting stored on 5 different nodes and there are overheads to maintain the consistency.

                                                            For the highest throughput endpoints we did make some changes to be more optimal on CRDB so we could run a smaller cluster, but it didn't require anything close to a rewrite.

                                                            • karmakaze 1 year ago
                                                              That's a good story and datapoint to hear. Meanwhile upgrading one version from MySQL 5.7 to 8.0 is taking a while and will be much appreciated where I am. I don't expect any problems but wouldn't be surprised to be surprised.
                                                            • namibj 1 year ago
                                                              You can have small queries, they just have to be be sent before you block on the results from the first of each group.
                                                            • jordanthoms 1 year ago
                                                              Clickhouse is a totally different use case - Cockroach is OLTP, Clickhouse is OLAP. We use both Cockroach and Clickhouse at scale and they are both great but not competing products - Cockroach is great for the types of reads and writes you do when serving user requests, processing transactions etc, but isn't optimal for analytics queries where you are going do things like read and aggregate data on a 50TB table. Clickhouse eats those kinds of aggregate queries for breakfast, and is fast for some types of small read queries too, but it's not built to handle random writes or frequently updating rows of data.
                                                            • cebert 1 year ago
                                                              This reads like a long form advertisement.
                                                              • al_borland 1 year ago
                                                                Case studies hosted on a company's own website generally are. It's kinds of an, "it worked for them, so it will work for you," thing.
                                                                • candiddevmike 1 year ago
                                                                  "Art of the possible" (YMMV)