Why CockroachDB doesn't use EvalPlanQual

135 points by michae2 1 year ago | 58 comments
  • michae2 1 year ago
    Author here. We've spent the past year adding read committed isolation to CockroachDB.

    There were many interesting design decisions, such as:

    - whether to use multiple snapshots or a single snapshot per statement

    - how to handle read uncertainty intervals

    - how to incorporate SELECT FOR UPDATE locking into Raft

    - how to handle SELECT FOR UPDATE subqueries

    - how to prevent lost update anomalies between two UPDATEs

    Some of the gory details are in the public RFC: https://github.com/cockroachdb/cockroach/blob/master/docs/RF...

    This blog post just discusses the last point, but please AMA.

    • sbstp 1 year ago
      Why is Cockroach adding READ COMMITTED? Is using a lower level of isolation better for performance or just reduces the amount of serialization errors and retries that need to be done?
      • kogir 1 year ago
        READ COMMITTED is great for applications that need a coherent snapshot of the database but not necessarily the absolutely most recent data, which in my experience is actually most apps.

        It allows readers to see valid data (relationships are correct), while not blocking writers. It can be the difference between constant deadlocks and super-high throughput without lock contention.

        • michae2 1 year ago
          The main motivation is reduce serialization errors, for applications that can handle the weaker isolation level. Especially for applications that were previously running fine under RC on another database.
        • nextaccountic 1 year ago
          Can this improvement also be implemented in Postgresql, removing the need for EvalPlanQual?

          Maybe as an option with a new syntax (UPDATE .. WHERE .. RETRYING) or something like this.

        • foota 1 year ago
          It's often talked about how new sql databases offer better scalability than standard SQL databases, but I think it's maybe sometimes underappreciated how (some, not all) of them are also much simpler in terms of their consistency models.

          I'd speculate this is because postgres and friends try to eek out every bit of single node performance (which helps with single row throughout and overall throughout, which is obviously much better for them than newsql) but the scalability of new SQL databases might allow them to prefer easy consistency over single node performance.

          Possibly this is also just the passage of time benefiting newer systems.

          • wbl 1 year ago
            Read committed is explicitly asking for hard mode. If you want a simple life stick with Serializable as always. It took years before people found anomalies in Repeatable Read in Postgres. This stuff is hard even for world class researchers.
            • paulryanrogers 1 year ago
              How does one get serializable in a multi-writer system without a lot more locking and having to retry at app layer?
              • Diggsey 1 year ago
                You always need app-level retries for SERIALIZABLE isolation level. You don't need any explicit locking - the database should handle that for you (and in the case of PostgreSQL, locking is not the only tool it uses for avoiding serialization anomalies).

                The strategy I use is to keep transactions as small as possible, and have retry functionality built into the transaction abstraction, so the buesiness logic doesn't really need to worry about it. I also explicitly use read-only transactions where possible.

            • mjb 1 year ago
              Even more generally, distributed systems can find simpler solutions to things like "raise the throughput ceiling", and "handle disk failure", and "handle power failure" than single-box systems. This is for the simple reason that they have more options: beyond the constraints of a box, resource allocation is more flexible, failures less correlated, etc. That allows modern distributed databases to simply avoid some of the super hard problems that prior databases had to solve. Efficiency is still important, but the thing to optimize is mean system efficiency, not the peak performance of a handful of super hot boxes.

              There's also the fact that decades of DB research have brought techniques and approaches that beat old ones, and retrofitting existing systems with them can be hard (e.g. see the efforts to remove some of the sharp edges of PG's MVCC behavior and how hard they've turned out to be).

            • CGamesPlay 1 year ago
              How does the CockroachDB approach not deadlock? Surely retrying could encounter a situation where two competing UPDATE will lock rows in different order, and no amount of retrying will unlock the required rows, right?
              • dilyevsky 1 year ago
                I wondered this too and found this in the docs[0]:

                  Transactions at all isolation levels are subject to lock contention, where a transaction attempts to lock a row that is already locked by a write or
                  locking read. In such cases, the later transaction is blocked until the earlier transaction commits or rolls back, thus releasing its lock on the row.
                  Lock contention that produces a deadlock between two transactions will result in a transaction abort and a 40001 error
                  (ABORT_REASON_ABORTED_RECORD_FOUND or ABORT_REASON_PUSHER_ABORTED) returned to the client.
                
                So looks like you still get a good old 40001 error just like with SERIALIZABLE isolation.

                [0] - https://www.cockroachlabs.com/docs/stable/read-committed#rea...

                • michae2 1 year ago
                  It's a good question. For simple UPDATEs, CockroachDB always executes in a deterministic, serial order and so it's likely the rows will be locked in the same order by any competing updates. (This can be confirmed by looking at the query plans.) Complex UPDATEs using joins and subqueries will need explicit ORDER BY to always lock in the same order.

                  If an UPDATE has to retry halfway through, locks are held across the retry to help the system make progress. But as you point out, this could cause lock acquisition to happen in an unexpected order if new rows qualify during a retry. So far we haven't run into this, but we might need to provide an option for an UPDATE to drop locks on retry if deadlock turns into a bigger problem than livelock. It depends on the workload.

                  • temac 1 year ago
                    > So far we haven't run into this, but we might need to provide an option for an UPDATE to drop locks on retry if deadlock turns into a bigger problem than livelock.

                    I'm not sure what you mean by that: the design can deadlock but you just have not seen it happening yet?

                    Edit: oh i see in a comment bellow that deadlocks are detected and abort the transaction.

                • erhaetherth 1 year ago
                  I'm having trouble with the example given.

                  If `UPDATE player SET level = 'AA' WHERE team = 'Gophers';` is executed before the player swap, then why should "Stonebreaker" be upgraded to "AA"? I'd be pretty mad at my database if I sent those 2 queries in sequence and my DB decided to re-order them.

                  The sleep actually really complicates things here. I understand some queries run slower than others and the sleep is a useful tool to artificially slow things down, but now I don't know I don't know if I should interpret that as one command or two. If `WITH sleep AS (SELECT pg_sleep(5)) UPDATE player SET level = 'AA' FROM sleep WHERE team = 'Gophers';` is atomic then I'd expect it to put a lock on the 3 Gophers (which doesn't include Stonebreaker), wait the 5 seconds and then complete the update. The player swap would be blocked for those 5 seconds because it touches a row that's being updated.

                  • michae2 1 year ago
                    The timing of this example is tricky because the two update statements execute concurrently (which is only possible under read committed isolation; under serializable isolation it's much more like what you're describing).

                    Here's a full timeline in PG (U1 for first update, U2 for second update):

                    0. U1 begins executing, establishes read snapshot, starts pg_sleep(5).

                    1. U2 runs to completion.

                    2. U1 wakes up after 5 sec, scans `player` using snapshot from step 0.

                    3. U1 filters `team = Gophers`, gets 4, 5, 6.

                    4. U1 locks 4, 5, 6.

                    5. U1 performs EvalQualPlan: re-scans latest version of those locked rows, which sees U2's write to 4 but not to 3.

                    6. U1 performs EvalQualPlan: re-filters those locked rows using latest version, gets 5, 6.

                    7. U1 writes new versions.

                    CRDB is easier to reason about: after U1 wakes up from the sleep, it sees that it conflicts with U2 and simply retries the entire statement.

                    • masklinn 1 year ago
                      > If `UPDATE player SET level = 'AA' WHERE team = 'Gophers';` is executed before the player swap, then why should "Stonebreaker" be upgraded to "AA"? I'd be pretty mad at my database if I sent those 2 queries in sequence and my DB decided to re-order them.

                      It’s easy to miss but in the swap query the levels also get swapped. Because — and it’s harder to miss but easy to skip over — given what constraint 2 says the level is actually a team level, not a player level.

                      So in a seqcst view, either the team’s players get upgraded to AA then the players’ levels get swapped during the exchange, or the players get exchanged then the team’s players get upgraded.

                      In both sequences you end up with Stonebaker as an AA gopher and Lamport as an A dolphin.

                      • npstr 1 year ago
                        > given what constraint 2 says the level is actually a team level, not a player level.

                        Then it's bad unnormalized data design that is the problem here. If that is a team level, it should be in the team table, not the player table.

                        • masklinn 1 year ago
                          > Then it's bad unnormalized data design that is the problem here.

                          That the table is not normalised makes the example somewhat confusing but it does not actually affect the issue being demonstrated. And denormalisation is a fact of data modelling.

                          • michae2 1 year ago
                            > If that is a team level, it should be in the team table, not the player table.

                            It's all contrived, of course, but the reason I would consider skill level to be a player attribute rather than a team attribute is that there could be free agents with a skill level but no team:

                            INSERT INTO player VALUES (10, 'Pavlo', 'AAA', NULL);

                            Then with enough free agents, you could imagine building a new team out of free agents that are all at the same skill level:

                            UPDATE player SET team = 'Otters' WHERE level = 'AAA' AND team IS NULL ORDER BY id LIMIT 3;

                      • ngalstyan4 1 year ago
                        For similar isolation level anomalies in real world applications check out this SIGMOD '17 paper:

                        ACIDRain: Concurrency-Related Attacks on Database-Backed Web Applications: http://www.bailis.org/papers/acidrain-sigmod2017.pdf

                        • ramchip 1 year ago
                          Great article. The `pg_sleep` is a nice trick to test for concurrency issues in postgres.
                          • jb1991 1 year ago
                            I sometimes wonder if the database would be more widely used if they had picked a different name.
                            • big_whack 1 year ago
                              My guess is orgs that have a real need for a distributed SQL database (which are rare in the space of orgs) will make their choices based on requirements analysis rather than naming.

                              So probably if you would be turned off enough by the name not to use the software, you don't actually need a distributed SQL database and are not the target customer.

                              • bradleyjg 1 year ago
                                The flip side of that is that businesses are more than okay with selling their products to organizations for which it’s overkill.

                                Gyms know that people that sign up on Jan 2 don’t need annual (or any) membership. They take their money anyway. The more honorable ones allow cancellation or something but I don’t think anyone expects them to actively turn these folks away.

                                • michae2 1 year ago
                                  To put it somewhat irreverently: running large production databases is not for the squeamish.
                                • brianmcc 1 year ago
                                  I think a lot of people miss the "Cockroaches survive everything" implication.

                                  It's meant to imply indestructability, not creepy yucky beastliness.

                                  Obviously that goes over the heads of many people initially but at least it can then be explained!

                                  • swsieber 1 year ago
                                    I don't wonder. It would be more widely considered, at the very least, judging by interactions I've had with a few people.
                                    • dgfitz 1 year ago
                                      I couldn’t imagine suggesting cockroachDB in a planning meeting for a new product. Especially when over half the room isn’t engineering. Just won’t happen.
                                      • duped 1 year ago
                                        Why are you wasting the time of half the room talking about technical decisions they wouldn't understand
                                    • echelon 1 year ago
                                      You know how "master", "slave", "blacklist", etc. cause bad feelings in a lot of people? So do terms like "roach", "cock", "cockroach", "maggot", etc.

                                      They could have named this waterbear or adamantium or something.

                                      • Exuma 1 year ago
                                        the worst name in existence
                                        • 1 year ago
                                          • brianmcc 1 year ago
                                            I think Mongo is way worse tbh
                                            • Exuma 1 year ago
                                              lol like blazing saddles
                                        • CurtHagenlocher 1 year ago
                                          UPDATE player set name = 'Stonebraker' where name = 'Stonebreaker'
                                        • qazxcvbnm 1 year ago
                                          Which of these strategies does Postgres use under Repeatable Read?
                                          • michae2 1 year ago
                                            PG only uses EvalPlanQual under read committed isolation. Under repeatable read the first update fails with a "could not serialize" error, just as it does under serializable.
                                          • slavabobik 1 year ago
                                            How you incorporate SELECT FOR UPDATE locking into Raft? Also maybe slightly off topic, but how does the CockroachDB deals with read-only queries and raft log?
                                            • arulajmani 1 year ago
                                              Prior to building read committed, CRDB only supported intents as a form of replicated locks. Intents are provisional values written by active transactions which are resolved once the writing transaction commits or aborts.

                                              As part of building read committed, we introduced other forms of replicated locks. In particular, replicated shared and exclusive locks (SELECT FOR UPDATE statements acquire exclusive locks). Locks acquired by read committed transactions are always replicated.

                                              The unit of replication in CockroachDB is called a range, and every range is its own raft group. Every range has a reserved portion of its keyspace where locks can be stored. This is known as the replicated lock table. Incorporating exclusive/shared locks in the replicated lock table involved adding a notion of lock strength to all locks that were stored (as previously they were all assumed to be intents).

                                            • DanielBen6 1 year ago
                                              [flagged]
                                              • pelagicAustral 1 year ago
                                                [flagged]
                                                • orthecreedence 1 year ago
                                                  Look, every single time this database is mentioned here, this exact comment comes up. Every. Time. For years, and years, and years. And yet, true to its name, the database and the team building it are still around.

                                                  At some point we've got to realize the CRDB team is comfortable with the name, it's not going to change, and all comments lambasting it are just farts in the wind.

                                                  • 1 year ago
                                                  • javawizard 1 year ago
                                                    I'm relatively ambivalent on the name issue specifically, but I do think this has been hashed over ad nauseum to the point of making for boring reading. This line from the HN guidelines:

                                                    > Please don't pick the most provocative thing in an article or post to complain about in the thread. Find something interesting to respond to instead.

                                                    stands out as particularly relevant.

                                                    • jrockway 1 year ago
                                                      I think the idea is that after a nuclear war, the only two things left are your data and some cockroaches.
                                                      • unethical_ban 1 year ago
                                                        I scoffed at first, but I guess cockroach sounds to some what "maggot" sounds like to me.

                                                        Like the other person said, it's related to the survivability of cockroaches. One can hope for a management tier competent enough to evaluate an internal database on the merits.

                                                        • jchw 1 year ago
                                                          Sometimes, I like to open The GIMP so I can take the logo for the Coq proof assistant, export it and commit it into the HEAD of my Git repository.

                                                          I think the weird names are fine, it's the management that needs fixing.

                                                          • nine_k 1 year ago
                                                            If not for git, you could have to commit to Subversion! And then back up the whole thing with Duplicity.

                                                            I'm not a fan of such sense of humor, and prefer neutral monikers like rsync, or Xfce, or Linux, but the ship has sailed.

                                                            It just gets normalized and the edginess erased with time and daily use. Imagine that somebody would suggest that you have to wash your hair with some "sham poo". They would be laughed out of the room, had not this word existed for centuries.

                                                          • anonzzzies 1 year ago
                                                            You name tools to your management? They won’t know what anything is anyway? The word database (some sort of Excel!) is usually an issue already. Why would you go into that much detail; just don’t?
                                                            • schmookeeg 1 year ago
                                                              Did Mongo get slagged in their early days because of the unflattering Blazing Saddles movie reference?

                                                              Call it whatever you want. Engineers should be able to select their tools on merit, not marketing.

                                                              • re-thc 1 year ago
                                                                Do you use Git? Does management know what it is?
                                                                • verticalscaler 1 year ago
                                                                  Or imagine a name like that for a band or car for the masses? Gawfah and hurmph indeed.
                                                                  • anamax 1 year ago
                                                                    Have you tried?
                                                                    • snapcaster 1 year ago
                                                                      Really? is your management stupid?
                                                                      • vcryan 1 year ago
                                                                        Ha! Do you think this is uncommon? The path to becoming a manager in many organizations is biased towards those who are good at politics, loyal, a friend, and charismatic—all sorts of things that don't ensure effective approaches to technology selection.
                                                                    • npstr 1 year ago
                                                                      Idk to me this looks like a modeling issue of the data. There should be a team table that contains team specific data such as the skill level, then these two queries wouldn't run into any problems.
                                                                      • nickpeterson 1 year ago
                                                                        Generally the best way to surmount a hard problem is to solve a different one.