We improved the performance of a userspace TCP stack in Go
226 points by infomaniac 1 year ago | 129 comments- dpeckett 1 year agoReally cool to see others hacking on netstack, bit of a shame it's tied up in the gVisor monorepo (and all the Bazel idiosyncracies) but it's a very neat piece of kit.
I've actually been hacking on a similar FOSS project lately, with a focus on building what I'm calling a layer 3 service mesh for the edge. More or less came out of my learned hatred for managing mTLS at scale and my dislike for shoving everything through a L7 proxy (insane protocol complexity, weird bugs, and you still have the issue of authenticating you are actually talking to the proxy you expect).
Last week I got the first release of the userspace router shipped, worth taking a look if you want to play around with a completely userspace and unprivileged WireGuard compatible VPN server.
https://github.com/noisysockets/nsh/blob/main/docs/router.md
- iangudger 1 year agoIf you want to use netstack without Bazel, just use the go branch:
https://github.com/google/gvisor/tree/go
go get gvisor.dev/gvisor/pkg/tcpip@go
The go branch is auto generated with all of the generated code checked in.
- dave78 1 year agoI did this once for an experimental project and found it really difficult to keep the version of gVisor I was using up to date, since it seems like the API is extremely volatile. Anyone else had this experience? If so, is there some way around it that I don't know? Or did I just try it at a bad point in the development timeline?
- mort96 1 year agoThat's just how Google operates in my experience... Avoid Google libraries unless absolutely necessary, and if you do adopt Google libraries, be prepared to either be forever multiple years out of date or spend significant resources on keeping it up to date.
- ignoramous 1 year agoThe API is indeed prone to change without notice, but it isn't anything terribly unmanageable.
> really difficult to keep the version of gVisor I was using up to date
For our project, we update gvisor whenever Tailscale does.
- iangudger 1 year agoIt could be that you happened to find a period of rapid change, but it is also possible that you ran into the issue that raggi mentioned in the sibling comment.
- mort96 1 year ago
- raggi 1 year agohey Ian, long time. Is there any chance y'all could swap out main so that main contains the generated code version?
I don't know the status on those export tools these days as I left the company years ago, but if they could sync with a different branch.
This would help various folks quite a bit, as for example tsnet users often fall into the trap of trying to do `go get -u`, which then pulls a non-functional gvisor version.
- iangudger 1 year agoI don't work on gVisor anymore. That said, I think it would be a tough sell. It would be a pretty big breaking change. Also, there is already a problem with people trying to send patches against the go branch and making it the default would make that much worse.
I think the solution is an automatically exported repository at a different path. Kind of (or maybe exactly) like what Tailscale/bradfitz used to maintain.
- iangudger 1 year ago
- dave78 1 year ago
- iangudger 1 year ago
- zxt_tzx 1 year agoI met one of the founders of Coder.com, he's a really cool dude. It's a pity that it is a product aimed more at enterprises than individual developers, else it would have far more developer mindshare.
Unlike, say, GitHub Codespaces, running something like this on your own infra means your incentives and Coder.com's are aligned, i.e. both of you want to reduce your cloud costs (as opposed to, say, GitHub running on Azure gives them an opportunity and incentive to mark up on Azure cloud costs).
- santiagobasulto 1 year agoIt seems like a great product. I'm wondering why they don't offer more "startup-oriented" plans. It's like either Self Hosted or "Talk to sales". Is it maybe to not compete against Github codespaces?
- kylecarbs 1 year agoFounder of Coder here. Many small (or teams at big) companies use Coder for free with <=150 devs just using our open-source.
We’ve tried to align our pricing with the value of the product. In small teams the productivity gains seem to be much lower, so we target Enterprise!
- withinboredom 1 year agoSpeaking of ... https://coder.com/docs -- what's next is empty.
- withinboredom 1 year ago
- kylecarbs 1 year ago
- santiagobasulto 1 year ago
- wmf 1 year ago"Asking for elevated permissions inside secure clusters at regulated financial enterprises or top secret government networks is at best a big delay and at worst a nonstarter."
But exfiltrating data with a userspace VPN is totally fine?
I'm also wondering why not use TLS.
- tptacek 1 year agoEvery connection you make to a remote service "exfiltrates data". Modern TLS is just as opaque to middleboxes as WireGuard is, unless you add security telemetry directly to endpoints --- and then you don't care about the network anyways, so just monitor the endpoint.
The reason you'd use WireGuard rather than TLS is that it allows you to talk directly to multiple services, using multiple protocols (most notably, things like Postgres and Redis) without having to build custom serverside "gateways" for each of those protocols.
- eqvinox 1 year agoAdding your own network stack to bypass limitations like this works exactly until the point where someone notices that your userspace stack needs to fulfill the same requirements that the host stack does.
And then you're suddenly in a whole world of pain because all of this is driven by a stack of byzantine certifications (half of which, as usual, are bogus, but that doesn't help you), and your network stack has none of them.
(Written from first-hand experience.)
- Honeymari333 1 year agoAny tips I'm a beginner
- Honeymari333 1 year ago
- taeric 1 year agoI think the point was more that doing this as a way to avoid the red tape of getting permission to open a new connection is odd?
- tptacek 1 year agoI understand the impulse, but I think it misconstrues the "red tape" this method avoids. It's sidestepping a quirky OS limitation, which dates back to an era of "privileged ports" and multi-user machines. It's not really sidestepping any sort of modern policy boundary. For instance: you could do the exact same thing with WebSockets (and people do).
- tptacek 1 year ago
- eqvinox 1 year ago
- anyfoo 1 year agoYou can't control what information flows through an outbound connection, not even in trivial cases. Even if you straight go ahead and say "I allow you to make this connection, but I'm not even allowing you to send any data", you have timing sidechannels to deal with. In any more reasonable case, an almost infinite number of things can be used to exfiltrate any data you want, even if you think you have not only full application-level inspection, but even application-level rewrite.
Pretty much the only thing you can do is somewhat filter out known-bad, not directly motivated outbound traffic, such as malware payloads with very clear signatures. This only works if it's "not directly motivated", because as soon as there's a person who wants to do it, they can skirt around it again.
- raggi 1 year agofwiw, you technically don't need a privileged container to use tun, you just need suitable permissions on the kernel tun interfaces.
- tazjin 1 year agoYeah, the optimisations are cool of course, but (maybe due to being unfamiliar with the tool?!) I didn't understand why they can't just `listen(2)`.
- vlovich123 1 year agoIt’s answered in the opening paragraph although I’ll admit I’m still unclear.
> We are committed to keeping your data safe through end-to-end encryption and to making Coder easy to run across a wide variety of systems from client laptops and desktops to VMs, containers, and bare metal. If we used the TCP implementation in the OS, we’d need a way for the TCP packets to get from the operating system back into Coder for encryption. This is called a TUN device in unix-style operating systems and creating one requires elevated permissions, limiting who can run Coder and where. Asking for elevated permissions inside secure clusters at regulated financial enterprises or top secret government networks is at best a big delay and at worst a nonstarter.
The specific part that’s unclear is why encryption needs to be applied at the TCP layer and at that point if they need it at the transport layer why they’re not using something like QUIC which has a much more mature user-space implementation.
- dpeckett 1 year agoI think the key insight behind this approach (and I'm biased here having written something similar) is that the difference between QUIC and (wireguard + network stack) is A LOT less than you might think.
- Xelynega 1 year agoI'm confused on why they would need a TUN device for a client or server application, so why they would need this solution in the first place(even with their explanation).
As I understand the only reason you'd use a TUN interface is if you want to send/receive raw IP packets. Their marketing doesn't make it very clear what their product does, but I can't see a reason it would need to send/receive raw IP packets rather than TCP/UDP packets over a specific port...
- cricketlover 1 year agoAgree. Very unclear why they won't simply use a secure socket or why a user space tunnel will be needed.
I surmise that the reason might be that a user space tunnel might be faster (like maybe they can do UDP over TCP or something to gain speed improvements).
Good post nevertheless.
- immibis 1 year agoOr TLS. It seems to be a remote cloud desktop type of product, so why not use TLS like every other one?
- neonsunset 1 year agoThe quote - is this yet another issue caused by abysmal FFI overhead in Go?
- dpeckett 1 year ago
- vlovich123 1 year ago
- tptacek 1 year ago
- parhamn 1 year agoI don't know anything about Coder, but Gvisor proliferation is annoying. It's a boon for cloud providers, helping them find another way to get a large multiple performance decrease per dollar spent in exchange for questionable security benefits. And I'm seeing it everywhere now.
- weitendorf 1 year agoI don't understand - what do you suggest as an alternative to Gvisor?
> large multiple performance decrease per dollar spent
Gvisor helps you offer multi-tenant products which can be actually much cheaper to operate and offer to customers, especially when their usage is lower than a single VM would require. Also, a lot of applications won't see big performance hits from running under Gvisor depending on their resource requirements and perf bottlenecks.
- parhamn 1 year ago> I don't understand - what do you suggest as an alternative to Gvisor?
Their performance documents you linked claim vs runc: 20-40x syscall overhead, half of redis' QPS, and a 20% increase in runtime in a sample tenserflow script. Also google "CloudRun slow" and "Digital Ocean Apps slow", both are Gvisor.
Literally anything else.
- amscanne 1 year agoA decent while ago, I was the original author of that performance guide. I tried to lay out the set of performance trade-offs in an objective and realistic way. It is shocking to me that you’re spending so much time commenting on a few figures from there, ostensibly w/o reading it.
System call overhead does matter, but it’s not the ultimate measure of anything. If it were, gVisor with the KVM platform would be faster than native containers (looking at the runsc-kvm data point which you’ve ignored for an unknown reason). But it is obviously more complex than that alone. For example, let’s click down and ask — how is it even possible to be faster? The default docker seccomp profile itself installs an eBPF filter that slows system calls by 20x! (And this path does not apply within the guest context.) On that basis, should you start shouting that everyone should stop using Docker because of the system call overhead? I would hope not, because looking at any one figure in isolation is dumb — consider the overall application and architecture. Containers themselves have a cost (higher context switch time due to cgroup accounting, costs to devirtualize namespaces in many system calls, etc.) but it’s obviously worth it in most cases.
The redis case is called out as a worst case — the application itself does very little beyond dispatching I/O, so almost everything manifests as overhead. But if you’re doing something that has 20% overhead, you need hard security boundaries, and fine-grained multi-tenancy can lower costs by 80% it might make perfect sense. If something doesn’t work for you because your trade-offs are different, just don’t use it!
- amscanne 1 year ago
- parhamn 1 year ago
- tptacek 1 year agoAre you referring to gVisor the container runtime, or gVisor/netstack, the TCP/IP stack? I see more uptick in netstack. I don't see proliferation of gVisor itself. "Security" is much more salient to gVisor than it is to netstack.
- parhamn 1 year agoIn the issue of abysmal performance on cloud-compute/PaaS Im talking about the container runtime (most Paas is gVisor or Firecracker, no?) cloudrun, DO, modal, etc.
But given this article is about improving gvisors userland tcp performance significantly, it seems like the netstack stuff causes major performance losses too.
I saw a github link in another top article today https://github.com/misprit7/computerraria where the Readme's Pitch section feels very relevant to gvisor.
- tptacek 1 year agoI don’t believe many PAAS run gVisor; a surprising number just run multitenant docker.
The netstack stuff here has nothing to do with the rest of gVisor.
- weitendorf 1 year agoIn the context of coder, the userspace TCP overhead should be negligible. Based on https://gvisor.dev/docs/architecture_guide/performance/ and assuming runc is mostly just using the regular kernel networking stack (I think it does, since it mostly just does syscall filtering?) it should be at most a 30% direct TCP performance hit. But in a real application you typically only spend a negligible amount of total time in the TCP stack - the client code, total e2e latency, and server code corresponding to a particular packet will take much more time.
You'll note their node/ruby benchmarks showed a substantially bigger performance hit. That's because the other gvisor sandboxing functionality (general syscall + file I/O) has more of an impact on performance, but also because these are network-processing bound applications (rare) that were still reaching high QPS in absolute terms for their perspective runtimes (do you know many real-world node apps doing 350qps-800qps per instance?).
Because coder is not likely to be bottlenecked by CPU availability for networking, the resource overhead should be inconsequential, and what's really important is the impact on user latency. But that's something likely on the order of 1ms for a roundtrip that is already spending probably 30-50ms at best in transit between client and server (given that coder's server would be running in a datacenter with clients at home or the office), plus the actual application logic overhead which is at best 10ms. And that's very similar to a lot of gvisor netstack use cases which is why it's not as big of a deal as you think it is.
TLDR: For the stuff you'd actually care about (roundtrip latency) in the coder usecase the perf hit of using gvisor netstack should be like 2% at most, and most likely much less. Either way it's small enough to be imperceivable to the actual human using the client.
- shanemhansen 1 year agoGoogle is my former employer and this statement isn't referring to stuff I heard while employed there.
But after I left, I heard a that alot of the poor performance of Cloud Run is just plain old oversubscribed shared core e2 stuff.
- tptacek 1 year ago
- parhamn 1 year ago
- kccqzy 1 year agoThere are still products from cloud providers that don't use gvisor. Basics like EC2 or GCE. Sounds like you chose the wrong cloud product.
- loosescrews 1 year agoCan you elaborate on your concern? Is the issue that you don't trust gVisor to keep the cloud provider secure?
- parhamn 1 year agoProviders managed secure shared environments for decades before ultra inefficient wrappers and runtimes like gVisor existed.
- tptacek 1 year agoNo. The providers that did so soundly used virtualization to accomplish this, and a big part of the appeal of K8s is having a much lightweight unit of scheduling than full virtualization. gVisor is a middle ground between full virtualization and shared-kernel multitenant (which has an abysmal security track record).
- lima 1 year agoOpenVZ, Virtuozzo and friends definitely weren't secure the way gVisor or Firecracker are. You can still do that and some providers do, doesn't make it a good idea.
- tptacek 1 year ago
- parhamn 1 year ago
- weitendorf 1 year ago
- raggi 1 year agoIt's great to see this, I know the team went on a long journey through this and the blog makes it almost look shorter and simpler than it was. I'm hoping one day we can all integrate the support for GSO that's been landing in gvisor too, but so far we've (tailscale) not had a chance to look deeply into that yet. It was really effective for our tun and UDP interfaces though.
- kylecarbs 1 year agoAt Coder we’re fans and users of Tailscale, so very happy to have these changes be consumed upstream as well!
- ignoramous 1 year ago> one day we can all integrate the support for GSO that's been landing in gvisor
Google engs recently rewrote the GSO bit, but unlike Tailscale, it is only for TCP, though.
Besides, gvisor has had "software" & "hardware" GSO support for as long as I can remember.
- kylecarbs 1 year ago
- pantalaimon 1 year agoThe obvious question is: How does it compare to the in-Kernel TCP stack?
- raggi 1 year agoIt's less mature, which shows up in lots of places, such as sometimes having less than ideal defaults (as in buffer sizes shown here), and bugs if you start using more fancy features (which improve over time of course).
This is approximately the case for any alternative IP stack you might pick though, a mature IP stack is a huge undertaking with all the many flavors of enhancements to IP and particularly TCP over the years, the high variance in platform behaviors and configurations and so on.
In general you should only take on a dependency of a lesser-used IP stack if you're willing to retain or train IP experts in house over the long haul, because as is demonstrated here, taking on such a dependency means eventually you'll find a business need for that expertise. If that's way outside of your budget or wheelhouse, it might be worth skipping.
- syzcowboy99 1 year agogVisor's netstack is still much slower than the kernel's (and likely always will be). The goal of this userspace netstack is not to compete with the kernel on performance, but offer an alternative that is more portable and secure.
- Xelynega 1 year agoHow is it more portable or secure than an API that's been stable for decades, and getting constant security fixes?
I see an explanation in their blog about avoiding TUN devices since they require elevated permissions, but why would you need a TUN device to send data to/from an application? I can't understand what their product does from the marketing material but it doesn't look like it would require constructing raw IP packets instead of TCP/UDP packets and letting the OS wrap them in the other layers.
- raggi 1 year agoYou can have multiple layers of security boundary on most of the customer-exposed surface area, and avoid more risky surface areas in the kernel.
Portable is a bit of a weird word here because for many of us with gray beards the word means architectures, kernels and systems, but I think in this context it tends to more mean "can run just as easily on my macbook as in a cloud container", but in practice the software isn't that portable, as Go isn't that portable - at least not in the context of vs. a niche C "portable network stack" that would build roughly anywhere that there's a working C toolchain, which is almost everywhere.
Constant security fixes for the kernel are a real pain in deployments unless you follow upstream kernels closely. If your business is in shipping Linux runtimes with a high packing density, you really need to find ways to minimize the exposed Linux surface area, or organize to be able to ship kernel upstream updates at an extremely high frequency (relative to normal infrastructure upgrade rates for kernels / mandatory reboots) (and I would not consider kexec safe in this kind of context, at all).
An alternative approach might be firecracker / microvms and so on, but those have their own tradeoffs too. The core point is that you want more than one layer between the host machines and the user code that wants to interact with Linux features.
- raggi 1 year ago
- raggi 1 year agofor some definition of portable which is deeply tied to the go runtime
- Xelynega 1 year ago
- raggi 1 year ago
- jiveturkey 1 year agohelp me understand something.
> we’d need a way for the TCP packets to get from the operating system back into Coder for encryption.
yes, this is commonly done via OpenSSL for example.
> This is called a TUN device in unix-style operating systems and creating one requires elevated permissions
waitasec, wut? sure you could use a TUN device I guess, but assuming some kind of multi-tenant separation is an underlying assumption they didn't mention in their intro, couldn't you also use cgroup'd containers? sorry if I'm not fluent in the terminology.
i'm struggling to understand the constraints that push them towards gVisor. simply needing to do encryption doesn't seem like justification. i'm sure they have very good reasons, but needing to satisfy a financial regulator seems orthogonal at best. i would just like to understand those reasons.
- 1 year ago
- 1 year ago
- nynx 1 year agoDoesn’t creating a raw socket need elevated permissions?
- tptacek 1 year agoThey're not creating raw sockets†. The neat thing about WireGuard is that it runs over vanilla UDP, and presents to the "client" a full TCP/IP interface. We normally plug that interface directly into the kernel, but you don't have to; you can just write a userspace program that speaks WireGuard directly, and through it give a TCP/IP stack interface directly to your program.
† I don't think? I didn't see them say that, and we do the same thing and we don't create raw sockets.
- vlovich123 1 year agoSo it tunnels TCP/IP over Wireguard UDP?
- tptacek 1 year agoCorrect (I mean, that's fundamentally what WireGuard is: a UDP TCP/IP tunnel, with strong modern encryption).
- ignoramous 1 year ago
- tptacek 1 year ago
- vlovich123 1 year ago
- tptacek 1 year ago
- convolvatron 1 year agois this part of the open source releases? I looked at the coder.com github, but couldn't find it. I haven't written a compatible TCP, but a different reliable transport in go userspace. fairness aside, i wonder why we dont see this more often. would love to take a look
- tazjin 1 year agoThey upstreamed their gVisor changes: https://github.com/google/gvisor/pull/10287
- tazjin 1 year ago
- andrewstuart 1 year agoIf you’re tunneling a better connection configuration isn’t the tunnel what defines the latency?
- andrewstuart 1 year agoI have a problem right now which is that it’s slow to copy large files from one side of the earth to the other. Is this the basis of a solution to that maybe?
- 392 1 year agoNo. Profile first. Make sure you've tried tweaking params like batch sizes.
- dpe82 1 year agoWhat do you think are the current problems contributing to your slow transfers?
- andrewstuart 1 year agoWindow and buffer size is a problem on high latency links.
- dpe82 1 year agoWhy do you suspect a user space implementation of TCP would improve those issues beyond existing kernel implementations?
- dpe82 1 year ago
- andrewstuart 1 year ago
- raggi 1 year agonot enough detail here to provide a good answer, but I can tell you explicitly that if you're using SMB you're likely not going to get good performance here even if your network stack is has tons of space to overcome bdp and congestion challenges.
- 392 1 year ago
- jijji 1 year agoit's a solution looking for a problem
- hpeter 1 year agoIt's an engineering challenge and they do solve a problem, it's just not your problem :) It's a nice read anyways.
- lxgr 1 year agogVisor definitely solves a problem for me: https://news.ycombinator.com/item?id=39900329
- hpeter 1 year ago
- yencabulator 1 year agotl;dr Increased TCP receive buffer size, implemented HyStart instead of traditional TCP slow start in gVisor's netstack, changed an in-process packet queue from drop-when-full to block-when-full.
- Narhem 1 year ago[flagged]