The big guys don't have the patience to wait for Linux kernel networking to be f...

corbet · on Sept 6, 2023

That's funny ... the "big guys" are some of the biggest contributors to the Linux network stack, almost as if they were actually using it and cared about how well it works.

jeffbee · on Sept 6, 2023

History has shown that tons of Linux networking scalability and performance contributions have been rejected by the gatekeepers/maintainers. The upstream kernel remains unsuitable for datacenter use, and all the major operators bypass or patch it.

sophacles · on Sept 6, 2023

All the major operators sometimes bypass or patch it for some use cases. For others they use it as is. For other still they laugh at you for taking the type of drugs that makes one think any CPU is sufficient to handle networking in code.

Networking isn't a one size fits all thing - different networks have different needs, and different systems in any network will have different needs.

Userland networking is great until you start needing to deal with weird flows or unexpected traffic - then you end up either needing something a bit more robust and your performance starts dropping because you added a bunch of branches to your code or switched over to a kernel implementation that handles those cases. I've seen a few cases of userland networking being slower than just doing the kernel - and being kept because sometimes the what you care about is control over packet lifecycle more than raw throughput.

Kernels prioritize robust network stacks that can handle a lot of cases good enough. Different implementations handle different scenarios better - there's plenty of very high performance networking done with vanilla linux and vanilla freebsd.

eddtests · on Sept 6, 2023

Do you have links on this? I’ve not heard anything about it

tptacek · on Sept 6, 2023

I believe they're paraphrasing the Snap paper, and also that they're extrapolating too far from it.

suprjami · on Sept 6, 2023

Look who you're arguing with.

tptacek · on Sept 6, 2023

Over the course of several years, the architecture underpinning Snap has been used in production for multiple networking applications, including network virtualization for cloud VMs [19], packet-processing for Internet peering [62], scalable load balancing [22], and Pony Express, a reliable transport and communications stack that is our focus for the remainder of this paper.

This paper suggests, as I would have expected, that Google uses userland networking in strategic spots where low-level network development is important (SDNs and routing), and not for normal applications.

jeffbee · on Sept 6, 2023

"and Pony Express" is the operative phrase. As the paper states on page 1, "Snap is deployed to over half of our fleet of machines and supports the needs of numerous teams." According to the paper it is not niche.

nolist_policy · on Sept 6, 2023

Makes sense, they're probably using QUIC in lots of products and the kernel can't accelerate that anyways, it would only pass opaque UDP packets to and from the application.

devonkim · on Sept 6, 2023

Last I remember as of at least 7 years ago Google et al were using custom NIC firmware to avoid having the kernel get involved in general (I think they managed to do a lot of Maglev directly on the NICs) because latency is so dang important at high speed networking speeds that letting anything context switch and need to wait on the kernel is a big performance hit. Not a lot of room for latency when you're working at 100 Gbps.

tptacek · on Sept 6, 2023

Isn't Pony Express a ground-up replacement for all of TCP/IP? It doesn't even present a TCP/UDP socket interface.

jeffbee · on Sept 6, 2023

Correct. That is my point. The sockets interface, and design choices within the Linux kernel, make ordinary TCP sockets too difficult to exploit in a datacenter environment. The general trend is away from TCP sockets. QUIC (HTTP/3) is a less extreme retreat from TCP, moving all the flow control, congestion, and retry logic out of the kernel and into the application.

An example of how Linux TCP is unsuitable for datacenters is that the minimum RTO is hard-coded to 200ms, which is essentially forever. People have been trying to land better or at least more configurable parameters upstream for decades. I am hardly the first person to point out the deficiencies. Google presented tuning Linux for datacenter applications at LPC 2022, and their deck has barely changed in 15 years.

tptacek · on Sept 6, 2023

At the point where we're talking about applications that don't even use standard protocols, we've stopped supplying data points about whether FreeBSD's stack is faster than Linux's, which is the point of the thread.

Later

Also, the idea that QUIC is a concession made to intractable Linux stack problems (the subtext I got from that comment) seems pretty off, since the problems QUIC addresses (HOLB, &c) are old, well known, and were the subject of previous attempts at new transports (SCTP, notably).