Race Conditions Can Be Useful for Parallelism

85 points by g0xA52A2A 2 years ago | 68 comments
  • bheadmaster 2 years ago
    Slight nitpick - the definition of "race condition" on Wikipedia [0] is:

        [...] the condition of an electronics, software, or other system where the system's substantive behavior is dependent on the sequence or timing of other uncontrollable events
    
    If we take the first example - Parallel BFS - the correctness of the output could be considered "system's substantive behavior". Properly implemented atomic checks (as demonstrated) would still guarantee to lead to correct output in all possible combinations of events. Therefore, the system's "substantive behavior" is not dependent on the sequence or timing of other uncontrollable events. Therefore, there is no "race condition" involved.

    Of course, the term "race condition" here is taken colloquially, for the sake of familiarity to the reader - the article has correctly recognized that the appropriate term for this kind of behavior is "non-determinism".

    [0] https://en.wikipedia.org/wiki/Race_condition

    • chrisseaton 2 years ago
      I don't know where Wikipedia got this 'substantive behaviour' requirement from, but for example it explicitly isn't part of the definition in the industry's definitive reference for parallelism - Padua.

      > A race condition occurs in a parallel program execution when two or more threads access a common resource, e.g., a variable in shared memory, and the order of the accesses depends on the timing, i.e., the progress of individual threads.

      > The disposition for a race condition is in the parallel program. In different executions of the same program on the same input access events that constitute a race can occur in different order, which may but does not generally result in different program behaviors (non-determinacy).

      Sometimes you deliberately program a full-on data race (which isn't a bug by definition, as the article says) for performance reasons.

      > Data races that are not manifestations of program bugs are called benign data races.

      • ajross 2 years ago
        That definition seems to conflate "determinism" (something largely impossible to achieve in asynchronously parallel systems) with "correctness" (an abstracted property of a system that doesn't have anything to do with determinism per se). It just doesn't seem useful.

        No, in overwhelmingly common usage, programmers use the term "race condition" as a category of software bug. We mean it in the correctness sense, not the one used in the linked article nor your reference. You'd be met with some very weird stares if you tried to explain how arbitrary SMP ordering of log entries or whatever was a "race condition".

        • bheadmaster 2 years ago
          > No, in overwhelmingly common usage, programmers use the term "race condition" as a category of software bug

          This was exactly my point.

              race condition = incorrect behavior
              non-determinism = correct behavior
          
          Language is tricky sometimes.
          • chrisseaton 2 years ago
            > That definition seems to conflate "determinism" (something largely impossible to achieve in asynchronously parallel systems)

            There are plenty of examples of entirely deterministic parallel models. Fork-join for one.

          • User23 2 years ago
            It’s worth noting that algorithms can be designed correct under nondeterministic execution. For example, quicksort is correct with a randomly selected pivot. And for associative and commutative functions it doesn’t matter what order they’re executed in the final result is always the same.

            Dijkstra’s guarded commands don’t specify an order for the conditional. The semantics is that the process is free to execute any one of the cases that has a true guard. Nevertheless he found them useful for developing and describing many algorithms.

            • tonyarkles 2 years ago
              > And for associative and commutative functions it doesn’t matter what order they’re executed in the final result is always the same.

              Not directed at you specifically, but just a reminder for anyone who hasn’t been burned by it yet: floating point addition is not associative. (a+b)+c != a+(b+c). It’s close, but if you’re not careful you can get bad results where the accumulated small errors turn into wrong answers.

              • ajross 2 years ago
                > It’s worth noting that algorithms can be designed correct under nondeterministic execution.

                In fact the whole concept of "symmetric multiprocessing" demands it.

              • remram 2 years ago
                By this definition, every access to a shared resource is a race condition, e.g. even when properly acquiring a lock. It is common knowledge that you introduce locks to remove race conditions so I would say something is definitely missing from the definition.
                • asveikau 2 years ago
                  I think "remove" the race condition is the wrong word. It should be "resolve" the race conditions.

                  The race exists because there are multiple accesses. This is resolved when there is a protocol for deciding who proceeds and who waits for the other.

                  • chrisseaton 2 years ago
                    Usually you’re adding locks not to remove the race condition but to make them not a bug.
                  • bheadmaster 2 years ago
                    Hey, I've just downloaded PADUA (http://dx.doi.org/10.1145/2633685), and skimming through it, I can't find a single mention of the phrase "race condition".

                    Is this the paper you're referring to? If not, could you please provide a reference to which PADUA you're referring to? I'd really like to read more on the subject, especially if the source is, as you claim, an industry reference.

                • shwestrick 2 years ago
                  Author here. I think it would be very strange to say that this code does not have a race condition. The whole point of the term is to identify circumstances where non-deterministic timing of events influences how you reason about correctness, which is exactly what we're doing here.
                  • bheadmaster 2 years ago
                    I've personally only heard the term "race condition" used to refer to bugs that have their source in non-deterministic execution of programs. In most cases, they refer to a specific sequence of events that the programmer did not foresaw, which lead to incorrect computation.

                    Using the term "race condition" in context of correct programs would make it cover exactly the same universe of programs as the term "non-determinism". I that think the distinction, however trivial (race condition = incorrect behavior, non-determinism = correct behavior), is still useful.

                    Great article, by the way. I did not mean to criticize it in any way. My "slight nitpick" about meaning of the words is really just that - a nitpick :)

                    • shwestrick 2 years ago
                      Ah I see. That's a fair point!

                      When talking about this kind of stuff to people who are unfamiliar with, say, lock-freedom, I've found that "non-determinism" is too vague --- people start thinking about things like randomness, or user interaction, etc. In contrast, the term "race condition" seems to hit the nail on the head.

                      But certainly, "race condition" also carries with it a bit of baggage ;)

                      • throwawaymaths 2 years ago
                        I work in the Erlang VM, and we always say stuff like "whether your output to IO hits first ot the logger line outputs to IO first is a race condition". Neither is substantively wrong, and you could definitely say "it's a race!", grab some popcorn, and watch the electrons do their thing.
                      • kazinator 2 years ago
                        If we put on a cynical hat, your page reads like "I've never heard of lock-free algorithms based on atomic operations. I therefore must have just invented it and I get to name it: how about beneficial use of race conditions?"
                        • shwestrick 2 years ago
                          Whoa, uhh, I mean, that's an extremely unfair and inaccurate characterization.
                      • worewood 2 years ago
                        Why not use the right term, then? "Coloquial" shouldn't be a thing with technical terms.

                        Race conditions are hard enough to explain to people and misusing the term just makes it more difficult.

                        • mizzao 2 years ago
                          Interesting, it's almost like one could call this a "randomized algorithm" (which we know can be faster than deterministic algorithms) but the non-determinism comes from the input data and code execution rather than a RNG.
                          • 2 years ago
                          • squeaky-clean 2 years ago
                            > This article needs additional citations for verification. (July 2010).

                            The definition you quote has no linked citation on Wikipedia. Usually a good sign that you should not treat those statements as definitive. A good Wikipedia article should not state any "facts" without a direct means of verification. Otherwise it's considered "original research" and against the wiki policy for a high quality article.

                            https://en.m.wikipedia.org/wiki/Wikipedia:No_original_resear...

                            • bheadmaster 2 years ago
                              I used Wikipedia in order to have some kind of reference, but I was fairly sure of the meaning beforehand.

                              Searching the internet for "race condition definition" and taking the top few results brings several definitions that all agree in spirit with the Wikipedia one (see below).

                              If you know of any more reliable source that doesn't agree with Wikipedia on the definition of "race condition", please post it here. This is a honest request - I am always grateful to those who correct my mistakes (in good faith).

                                  wordnik [0]:  A flaw in a system or process whereby the output or result is unexpectedly and critically dependent on the sequence or timing of other events.
                              
                                  techtarget [1]: A race condition is an undesirable situation that occurs when a device or system attempts to perform two or more operations at the same time, but because of the nature of the device or system, the operations must be done in the proper sequence to be done correctly.
                              
                                  techterms [2]: A race condition occurs when a software program depends on the timing of one or more processes to function correctly.
                              
                                  javatpoint [3]: When the output of the system or program depends on the sequence or timing of other uncontrolled events, this condition is called Race Condition.
                              
                                  technopedia [4]: A race condition is a behavior which occurs in software applications or electronic systems, such as logic systems, where the output is dependent on the timing or sequence of other uncontrollable events.
                              
                              
                              [0] https://www.wordnik.com/words/race%20condition

                              [1] https://www.techtarget.com/searchstorage/definition/race-con...

                              [2] https://techterms.com/definition/race_condition

                              [3] https://www.javatpoint.com/what-is-race-condition

                              [4] https://www.techopedia.com/definition/10313/race-condition

                              • squeaky-clean 2 years ago
                                > If you know of any more reliable source that doesn't agree with Wikipedia on the definition of "race condition", please post it here.

                                Other commenters have already covered that. The links you shared aren't good primary sources, and probably took their definition from Wikipedia itself, creating a circular reference issue.

                            • 2 years ago
                            • compressedgas 2 years ago
                              This reminds me of Kuper and Newton's LVars paper that introduces lattice variables in Haskell: https://users.soe.ucsc.edu/~lkuper/papers/lvars-fhpc13.pdfhttp://dx.doi.org/10.1145/2502323.2502326
                              • shwestrick 2 years ago
                                Yes! It's a very similar idea. If I remember correctly, LVars are restricted enough to enforce determinism statically, which is quite nice.
                              • layer8 2 years ago
                                > I’m not talking about data races. Data races are typically bugs, by definition.

                                One notable exception is the Racy Single-Check Idiom: http://javaagile.blogspot.com/2013/05/the-racy-single-check-...

                                It is particularly suitable for lazy initialization in code that is typically (but not necessarily) executed single-threaded, and is famously used in Java’s String.hashCode() implementation.

                                • shwestrick 2 years ago
                                  That's a nice example. It seems that data races in Java don't "catch fire"; is that correct? The catch-fire problem is pretty bad for languages like C/C++, which have undefined behavior for data races, and in this sense data races are "bugs by definition" in those languages.
                                  • kaba0 2 years ago
                                    Java’s primitives and references are guaranteed to be “tear-free”, which guarantees no “out-of-thin-air” values. So a field set to 1 and being written by several threads to 2 and 3 can only ever be observed as 1,2 or 3, no other value. Is that what you mean under not catching fire?
                                    • shwestrick 2 years ago
                                      Perfect. Yes, that's exactly right -- if the language semantics is able to guarantee a set of possible values for a data-racy read, then it doesn't catch fire.

                                      The catch-fire terminology comes from the analogy that, as soon as a data race occurs, the semantics of the program completely explodes, and all guarantees are lost---the program is then allowed to do literally anything. This is sometimes known as "DRF-SC or catch fire": either the program is data-race-free (and therefore its executions are sequentially consistent), or the program has undefined behavior.

                                      Infamously, the C memory model has the catch-fire problem. And therefore, any language which relies on the C memory model can catch-fire. As of today, I believe this includes C/C++, Swift, Rust, and probably a few others.

                                      • moonchild 2 years ago
                                        > Java’s primitives and references are guaranteed to be “tear-free”, which guarantees no “out-of-thin-air” values

                                        Tear-free does not imply no out-of-thin-air. But, afaik, the java memory model protects from both tearing and oota.

                                        • oconnor663 2 years ago
                                          I think catching fire here refers to the general "this is UB and all bets are off" situation. For example, if you're data racing against an int on x86-64 Linux, you might reason that the `mov` instruction on that platform can't possibly product torn reads. And you'd be right as far as that goes. But the C/C++/Rust compiler still considers those data races UB and may still do horrible things like just deleting big chunks of your code if it manages to prove that that's what you've done.
                                    • heydenberk 2 years ago
                                      I appreciate a provocative title, but I think the practical lesson in almost all cases is the inverse: fixing race conditions can introduce performance bottlenecks.
                                      • shwestrick 2 years ago
                                        Author here. It all depends on what the goal is. If performance is the goal, then perhaps race conditions can be considered acceptable, if the gains are significant enough.

                                        I would hope that the primary takeaway from this post is that race conditions are not necessarily bugs. Race conditions are not necessarily something that need to be "fixed".

                                      • touisteur 2 years ago
                                        Tell me about non guaranteed order of operations in GPU reductions and floating point results changing slightly between two runs. Yes it's useful and you get the goddamn FP32 TFLOPS, but damn it makes testing, validating, qualifying systems harder. And yes, I know one shouldn't rely and test on equality, but not knowing the actual order of FP operations makes numerical analysis of the actual error harder (just take the worst case of every reduction, ugh).

                                        EDIT: and don't get me started on tensor cores and clever tricks to have them do 'fp32-alike' accuracy. Yes, wonderful magic but how do you reason about these new objects without a whole new slew of tools.

                                        • mgaunard 2 years ago
                                          There is nothing wrong with testing for equality so long as your computation is exact or correctly-rounded.
                                          • nextaccountic 2 years ago
                                            There is if the algorithm contains race conditions that cause non-deterministic output. The submitted article goes above and beyond to guarantee that the code always output the same answer even though it has race conditions. But that's sometimes not possible or, if it's possible, it's too much of a hassle so it's rarely done.

                                            For example, this project https://github.com/EmbarkStudios/texture-synthesis generates textures and if you run the same code with the same input various times, the results will be slightly different. Here https://github.com/EmbarkStudios/texture-synthesis#notes it says: "When using multiple threads for generation, the output image is not guaranteed to be deterministic with the same inputs. To have 100% determinism, you must use a thread count of one"

                                            • mgaunard 2 years ago
                                              I gave conditions upon which testing for equality is correct.

                                              Of course if the result is non-deterministic it doesn't satisfy those conditions.

                                            • touisteur 2 years ago
                                              Yes, I know. For purely sequential code that's the actual use (though sometimes, golden tests are generated through matlab or python, in double precision and then every divergence becomes a game of whack a mole. And don't start me on x87-80 bits extended precision suddenly compiled to SSE, so actual ieee754... We have integrated some of the FP static and dynamic analysis tools in our CI/CD pipeline for new code but ugh...

                                              Anyway, as time passes by I veer off equality and think about the actual necessary accuracy and wish there was a way to set it as a spec for proof (SPARK/Ada or a higher level DSL that can be lowered to proper accuracy analysis tools...

                                              I wish I could also specify 'no NaNs please' as a postcondition. Need to check in with the SPARK team and get an introduction article going...

                                              • mgaunard 2 years ago
                                                There are simple tools that tell you how many of your floating-point digits are just propagated rounding errors.
                                              • chrisseaton 2 years ago
                                                I think some people have gotten the mistaken idea that floating point arithmetic is inherently somehow non-deterministic. It is of course entirely deterministic, and if you do the same FP operations in the same order you will get the same result.
                                                • touisteur 2 years ago
                                                  I was thinking and talking specifically of GPUs and reductions (aka multiple core interacting) during which you don't always know the exact order of operations.

                                                  And also, tell that to the people that went from a compiler using x87 instructions to one using SSE instructions and between two binaries from the same code get different results. Yes, the exact same suite of FP instructions should always give the same results. And that's also supposing you're not loading some library that sets ugly fast-maths flags (see the recent yak-shaving session by @moyix).

                                                  • nextaccountic 2 years ago
                                                    You get the same result if you run on the same architecture and with the same instructions (or if you run on machines that implement IEEE 754-2008 [*], which was the first standard that guaranteed cross-platform determinism for a subset of floating point operations, which means, no SIMD!! =/), and you don't have non-determinism introduced by thread interleaving and race conditions (unless you very carefully account for that, like the article submitted in this thread)

                                                    I wish we had a language that guaranteed that the results of a computation were deterministic, all the while it properly enabled the use of all available hardware resources (so: using SIMD, all CPU cores, and also offloading some code to the a GPU if available), even if it had some overhead. Doing this manually is ridiculously difficult if you want to write high performance software, specially if you use GPUs.

                                                    [*] See https://stackoverflow.com/questions/42181795/is-ieee-754-200... - the amazing Rapier physics engine https://rapier.rs/ leverages IEEE 754-2008 to have a cross-platform deterministic mode that will run physics exactly the same way in every supported platform https://rapier.rs/docs/user_guides/rust/determinism/ - but this means taking a huge performance hit: you can't use SIMD and you must run the physics on a single thread.

                                              • hegelstoleit 2 years ago
                                                If you're just doing BFS why do you care who the parent is? Why not just choose the parent to be the predecessor? I.e if you visit 4 from 1, then 1 is the parent. Why do you need to check a list of potential parents?
                                                • shwestrick 2 years ago
                                                  When visiting vertices in parallel, there might be multiple potential parents that all attempt to visit the same vertex simultaneously. So, we need a way of picking which parent "wins".