How a little bit of TCP knowledge is essential

299 points by dar8919 9 years ago | 41 comments
  • Animats 9 years ago
    That still irks me. The real problem is not tinygram prevention. It's ACK delays, and that stupid fixed timer. They both went into TCP around the same time, but independently. I did tinygram prevention (the Nagle algorithm) and Berkeley did delayed ACKs, both in the early 1980s. The combination of the two is awful. Unfortunately by the time I found about delayed ACKs, I had changed jobs, was out of networking, and doing a product for Autodesk on non-networked PCs.

    Delayed ACKs are a win only in certain circumstances - mostly character echo for Telnet. (When Berkeley installed delayed ACKs, they were doing a lot of Telnet from terminal concentrators in student terminal rooms to host VAX machines doing the work. For that particular situation, it made sense.) The delayed ACK timer is scaled to expected human response time. A delayed ACK is a bet that the other end will reply to what you just sent almost immediately. Except for some RPC protocols, this is unlikely. So the ACK delay mechanism loses the bet, over and over, delaying the ACK, waiting for a packet on which the ACK can be piggybacked, not getting it, and then sending the ACK, delayed. There's nothing in TCP to automatically turn this off. However, Linux (and I think Windows) now have a TCP_QUICKACK socket option. Turn that on unless you have a very unusual application.

    Turning on TCP_NODELAY has similar effects, but can make throughput worse for small writes. If you write a loop which sends just a few bytes (worst case, one byte) to a socket with "write()", and the Nagle algorithm is disabled with TCP_NODELAY, each write becomes one IP packet. This increases traffic by a factor of 40, with IP and TCP headers for each payload. Tinygram prevention won't let you send a second packet if you have one in flight, unless you have enough data to fill the maximum sized packet. It accumulates bytes for one round trip time, then sends everything in the queue. That's almost always what you want. If you have TCP_NODELAY set, you need to be much more aware of buffering and flushing issues.

    None of this matters for bulk one-way transfers, which is most HTTP today. (I've never looked at the impact of this on the SSL handshake, where it might matter.)

    Short version: set TCP_QUICKACK. If you find a case where that makes things worse, let me know.

    John Nagle

    • tptacek 9 years ago
      How did you end up switching from networking to animating falling bodies? What are you doing these days?

      I wish you hadn't signed your comment, so we could have had the "I am John Nagle" moment when someone inevitably tried to pedantically correct you. :)

      • Animats 9 years ago
        "How did you end up switching from networking to animating falling bodies? What are you doing these days?"

        Ford Aerospace got out of networking, then computer science, then closed the Palo Alto facility. I was out long before then.

        What am I doing now? Robotics, again. Most recent GitHub commit: [1]

        [1] https://github.com/John-Nagle/uarm_util

      • masklinn 9 years ago
        > The combination of the two is awful.

        Apparently Greg Minshall proposed tinygram prevention alternations 15 years ago to fix the problematic interaction: https://tools.ietf.org/html/draft-minshall-nagle-01

        OSX seems to have implemented this in 2007 and be less/not sensitive to the issue e.g. http://neophob.com/2013/09/rpc-calls-and-mysterious-40ms-del... notes that there was no delay on OSX

        > it took around 40ms until my application get’s the data. I tested the application on a regular Linux system (Ubuntu) with the same result, so it’s not a RPi limitation. On my OSX MacBook Air however the RPC call needed only 3ms!

        • masklinn 9 years ago
          alterations not alternations, of course
        • jvns 9 years ago
          One thing that confuses me is -- are ACK delays part of the default TCP implementation on Linux? I originally assumed this was some kind of edge case / unusual behavior.
        • Dylan16807 9 years ago
          It still seems wasteful to wait an entire round trip before sending, rather than 1/4 round trip or so.
        • barrkel 9 years ago
          This is a general problem of leaky abstractions. If you're a top-down thinker, you're going to have a bad time some day and have a hard time figuring it out.

          OTOH bottom up thinkers take much longer to become productive in an environment with novel abstractions.

          Swings and roundabouts. Top down is probably better in a startup context - it's more conducive to broad and shallow generalists. Bottom up is great when you have a breakdown of abstraction through the stack, or when you need a new solution that's never been done quite the same way before.

          • dunkelheit 9 years ago
            Careless piling of layers atop layers is the main reason why everything is slow when computers are crazy fast. Every moderately complex piece of software is so inefficient that it is better not to think about it or else you become paralyzed in horror ;)

            Usually something is done to mitigate these inefficiencies only when they become egregious. And that is when even basic knowledge of the inner workings of underlying layers really pays off (see also: mechanical sympathy).

            • kuschku 9 years ago
              I am currently writing a client to a synchronized application system, and you only really notice how it’s layers upon layers when you write custom functions to serialize/deserialize primitive data types to a raw socket, and then on the next layer already can just abstract and write objects first to a HashMap, and then use the HashMap serializer for sending the actual object. And then you go yet another layer higher and use reflection to automatically sync method calls.

              It’s really crazy to think about it.

            • the8472 9 years ago
              Probably good to have both around. One can meet in the middle.
            • jfb 9 years ago
              I really enjoy reading Julia's blog. Not only does she have a real, infectious enthusiasm for learning; not only is the blog well written; but I also often learn a lot. Kudos.
              • bufordsharkley 9 years ago
                Yeah, I was going to post something to the same effect. Her posts (and videos) really drive into not just the how, but the WHY people should care, and her writing is lively and clear. I really love the ambition she has to learn all these things and share it with the world, and to make it inclusive for folks of all experience levels.
              • p00b 9 years ago
                John Rauser of pinterest gave a wonderful talk about TCP and the lower bound of Internet latency recently that has a lot in common with what's discussed in the article here. Worth a watch I think if you enjoyed the blog post.

                https://www.youtube.com/watch?v=C8orjQLacTo

                • PeterWhittaker 9 years ago
                  Summary: If you know learn a little, you realize that each packet might be separately acknowledged before the next one is sent. In particular, note this quote: Net::HTTP doesn’t set TCP_NODELAY on the TCP socket it opens, so it waits for acknowledgement of the first packet before sending the second.

                  By setting TCP_NODELAY, they removed a series of 40ms delays, vastly improving performance of their web app.

                  • colanderman 9 years ago
                    You don't need to entirely disable Nagle; just flash TCP_NODELAY on then off immediately after sending a packet for which you will block for a reply. This way you still get the benefit Nagle brings of coalescing small writes, without the downside.

                    (Alternatively, turn Nagle off entirely and buffer writes manually or using MSG_MORE or TCP_CORK.)

                    • dantiberian 9 years ago
                      I came across this this week working on the RethinkDB driver for Clojure (https://github.com/apa512/clj-rethinkdb/pull/114). As soon as I saw "40ms" in this story I thought "Nagles Algorithm".

                      One thing I haven't understood fully is that this only seems to be a problem on Linux, Mac OS X didn't exhibit this behaviour.

                    • bboreham 9 years ago
                      Why wouldn't an http client library turn off Nagle's algorithm by default?
                      • neduma 9 years ago
                        Can wireshark/riverbed (application perf tests) profiling help to solve these kind of problems?
                        • spydum 9 years ago
                          Wireshark can show you the delay but it won't tell why it's there. You might assume it's some quirk of your application.. Most people don't consider the kernel/network libraries and drivers.. Those are all black magic
                          • jvns 9 years ago
                            wireshark would totally help!
                          • rjurney 9 years ago
                            In highschool I carried TCP Illustrated around with me like a bible. I cherished that book. Knowledge of networks would eventually be incredibly useful throughout my career.
                            • mwfj 9 years ago
                              This can be generalised. It is also one of my favorite ways of doing developer interviews. Do they have a working/in-depth knowledge of what keeps the inter webs running? So many people have never ventured out of their main competence bubble, and that bubble can be quite small (but focused, I suppose).

                              For all I know, they believe everything is kept together with the help of magic. I guess I don't trust people who don't have a natural urge to understand at least the most basic things of our foundations.

                              • hueving 9 years ago
                                I used to thing this way before I realized it's just an arbitrary hoop you make people jump through. To you, people understanding TCP might be what you claim is a basic foundation. However, it's just about as arbitrary as asking people to explain 802.11 RTS/CTS or clos switch fabrics, which are both equally as important to delivering day-to-day network traffic. Additionally, they both can come up as things you need to understand when trying to optimize jitter/latency in sensitive local network traffic applications.

                                Don't judge people based on which components of networks they happened to take an interest in and dive into.

                                • xxpor 9 years ago
                                  Part of the "problem" is that networks (even networks that span the globe) are too reliable these days. If a developer is developing network services, all they see is an input stream and an output stream. And that's all that matters 99% of the time. But add 0.1% loss, and all hell breaks lose because people don't understand the implications and rules governing the underlying protocols.
                                  • staticmalloc 9 years ago
                                    I learned a lot about TCP (including delayed acks), the basics of 802.11, and the basics of switching fabrics in my undergrad networking class, so I wouldn't say it's totally unrealistic for someone to talk a little about those topics, depending on the role.
                                    • sokoloff 9 years ago
                                      On the other extreme, I've seen people ask why is that IP address divided by 16? (10.0.0.0/16)
                                      • 9 years ago
                                        • jacquesm 9 years ago
                                          And then you have to explain to them that it's divided by 2^16...
                                    • Ono-Sendai 9 years ago
                                      This is my proposed solution to this kind of problem: Sockets should have a flushHint() API call: http://www.forwardscattering.org/post/3
                                      • Animats 9 years ago
                                        Look up the history of the PUSH bit in TCP.
                                        • Ono-Sendai 9 years ago
                                          Ok, and is there a portable way to set this PUSH bit in the sockets API? the semantics seem a little different as well since the PUSH bit seems to do something on the receiver side as well.
                                        • daurnimator 9 years ago
                                          Sounds exactly like what TCP_NODELAY does on linux.

                                          Note the final sentence from tcp(7):

                                              TCP_NODELAY
                                                        If  set,  disable the Nagle algorithm.  This means that segments are always sent as soon as possible, even if there is
                                                        only a small amount of data.  When not set, data is buffered until there is a sufficient amount to send  out,  thereby
                                                        avoiding  the  frequent  sending  of  small packets, which results in poor utilization of the network.  This option is
                                                        overridden by TCP_CORK; however, setting this option forces an explicit flush of pending output, even if  TCP_CORK  is
                                                        currently set.
                                          • Ono-Sendai 9 years ago
                                            There is some overlap, yes. TCP_CORK is a mode however. It's silly to introduce the complexity of extra state when a single method call (flushHint()) would suffice.

                                            My proposed flushHint() is also quite different to TCP_NODELAY. Let's say you do 100 writes of 1 byte to a socket. If TCP_NODELAY is set, 100 packets would be sent. However if you do 100 writes to the socket, then one flushHint() call, only one packet would be sent.

                                            • daurnimator 9 years ago
                                              > There is some overlap, yes. TCP_CORK is a mode however. It's silly to introduce the complexity of extra state when a single method call (flushHint()) would suffice.

                                              It is a single call. Note that last sentence from the man page entry: "setting this option forces an explicit flush of pending output, even if TCP_CORK is currently set."

                                              When TCP_CORK is on (turn it on once at socket creation time), the following code is the implementation of your flushHint function:

                                                  int flushHint(int fd) {
                                                      return setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &(int){ 1 }, sizeof(int))
                                                  }