为什么能有上百万个 Goroutines，却只能有上千个 Java 线程？

Nov 14, 2018 15:00 · 4133 words · 9 minute read Golang Review

译文

很多有经验的工程师在使用基于 JVM 的语言时，都会看到这样的错误：

[error] (run-main-0) java.lang.OutOfMemoryError: unable to create native thread:
[error] java.lang.OutOfMemoryError: unable to create native thread:
[error] 	at java.base/java.lang.Thread.start0(Native Method)
[error] 	at java.base/java.lang.Thread.start(Thread.java:813)
...
[error] 	at java.base/java.lang.Thread.run(Thread.java:844)

呃，这是由线程所造成的OutOfMemory。在我的笔记本电脑上运行 Linux 操作系统时，仅仅创建 11500 个线程之后，就会出现这个错误。

如果你在 Go 语言上做相同的事情，启动永远处于休眠状态的 Goroutines，那么你会看到非常不同的结果。在我的笔记本电脑上，在我觉得实在乏味无聊之前，我能够创建七千万个 Goroutines。那么，为什么 Goroutines 的数量能够远远超过线程呢？要揭示问题的答案，我们需要一直向下沿着操作系统进行一次往返旅行。这不仅仅是一个学术问题，它对你如何设计软件有现实的影响。在生产环境中，我曾经多次遇到 JVM 线程的限制，有些是因为糟糕的代码泄露线程，有的则是因为工程师没有意识到 JVM 的线程限制。

那到底什么是线程？

术语“线程”可以用来描述很多不同的事情。在本文中，我会使用它来代指一个逻辑线程。也就是：按照线性顺序的一系列操作；一个执行的逻辑路径。CPU 的每个核心只能真正并发同时执行一个逻辑线程。这就带来一个固有的问题：如果线程的数量多于内核的数量，那么有的线程必须要暂停以便于其他的线程来运行工作，当再次轮到自己的执行的时候，会将任务恢复。为了支持暂停和恢复，线程至少需要如下两件事情：

1. 某种类型的指令指针。也就是，当我暂停的时候，我正在执行哪行代码？
2. 一个栈。也就是，我当前的状态是什么？栈中包含了本地变量以及指向变量所分配的堆的指针。同一个进程中的所有线程共享相同的堆。

鉴于以上两点，系统在将线程调度到 CPU 上时就有了足够的信息，能够暂停某个线程、允许其他的线程运行，随后再次恢复原来的线程。这种操作通常对线程来说是完全透明的。从线程的角度来说，它是连续运行的。线程能够感知到重新调度的唯一方式是测量连续操作之间的计时。

回到我们最原始的问题：我们为什么能有这么多的 Goroutines 呢？

JVM 使用操作系统线程

尽管并非规范所要求，但是据我所知所有的现代、通用 JVM 都将线程委托给了平台的操作系统线程来处理。在接下来的内容中，我将会使用“用户空间线程（user space thread）”来代指由语言进行调度的线程，而不是内核 /OS 所调度的线程。操作系统实现的线程有两个属性，这两个属性极大地限制了它们可以存在的数量；任何将语言线程和操作系统线程进行 1:1 映射的解决方案都无法支持大规模的并发。

在 JVM 中，固定大小的栈

使用操作系统线程将会导致每个线程都有固定的、较大的内存成本

采用操作系统线程的另一个主要问题是每个 OS 线程都有大小固定的栈。尽管这个大小是可以配置的，但是在 64 位的环境中，JVM 会为每个线程分配 1M 的栈。你可以将默认的栈空间设置地更小一些，但是你需要权衡内存的使用，因为这会增加栈溢出的风险。代码中的递归越多，就越有可能出现栈溢出。如果你保持默认值的话，那么 1000 个线程就将使用 1GB 的 RAM。虽然现在 RAM 便宜了很多，但是几乎没有人会为了运行上百万个线程而准备 TB 级别的 RAM。

Go 的行为有何不同：动态大小的栈

Golang 采取了一种很聪明的技巧，防止系统因为运行大量的（大多数是未使用的）栈而耗尽内存：Go 的栈是动态分配大小的，随着存储数据的数量而增长和收缩。这并不是一件简单的事情，它的设计经历了多轮的迭代。我并不打算讲解内部的细节（关于这方面的知识，有很多的博客文章和其他材料进行了详细的阐述），但结论就是每个新建的 Goroutine 只有大约 4KB 的栈。每个栈只有 4KB，那么在一个 1GB 的 RAM 上，我们就可以有 250 万个 Goroutine 了，相对于 Java 中每个线程的 1MB，这是巨大的提升。

在 JVM 中：上下文切换的延迟

从上下文切换的角度来说，使用操作系统线程只能有数万个线程

因为 JVM 使用了操作系统线程，所以依赖操作系统内核来调度它们。操作系统有一个所有正在运行的进程和线程的列表，并试图为它们分配“公平”的 CPU 运行时间。当内核从一个线程切换至另一个线程时，有很多的工作要做。新运行的线程和进程必须要将其他线程也在同一个 CPU 上运行的事实抽象出去。我不会在这里讨论细节问题，但是如果你对此感兴趣的话，可以阅读更多的材料。这里比较重要的就是，切换上下文要消耗 1 到 100 微秒。这看上去时间并不多，相对现实的情况是每次切换 10 微秒，如果你想要每秒钟内至少调度每个线程一次的话，那么每个核心上只能运行大约 10 万个线程。这实际上还没有给线程时间来执行有用的工作。

Go 的行为有何不同：在一个操作系统线程上运行多个 Goroutines

Golang 实现了自己的调度器，允许众多的 Goroutines 运行在相同的 OS 线程上。就算 Go 会运行与内核相同的上下文切换，但是它能够避免切换至ring-0以运行内核，然后再切换回来，这样就会节省大量的时间。但是，这只是纸面上的分析。为了支持上百万的 Goroutines，Go 需要完成更复杂的事情。

即便 JVM 将线程放到用户空间，它也无法支持上百万的线程。假设在按照这样新设计系统中，新线程之间的切换只需要 100 纳秒。即便你所做的只是上下文切换，如果你想要每秒钟调度每个线程十次的话，你也只能运行大约 100 万个线程。更重要的是，为了完成这一点，我们需要最大限度地利用 CPU。要支持真正的大并发需要另外一项优化：当你知道线程能够做有用的工作时，才去调度它。如果你运行大量线程的话，其实只有少量的线程会执行有用的工作。Go 通过集成通道（channel）和调度器（scheduler）来实现这一点。如果某个 Goroutine 在一个空的通道上等待，那么调度器会看到这一点并且不会运行该 Goroutine。Go 更近一步，将大多数空闲的线程都放到它的操作系统线程上。通过这种方式，活跃的 Goroutine（预期数量会少得多）会在同一个线程上调度执行，而数以百万计的大多数休眠的 Goroutine 会单独处理。这样有助于降低延迟。

除非 Java 增加语言特性，允许调度器进行观察，否则的话，是不可能支持智能调度的。但是，你可以在“用户空间”中构建运行时调度器，它能够感知线程何时能够执行工作。这构成了像 Akka 这种类型的框架的基础，它能够支持上百万的 Actor。

结论

操作系统线程模型与轻量级、用户空间的线程模型之间的转换在不断发生，未来可能还会继续。对于高度并发的用户场景来说，这是唯一的选择。然而，它具有相当的复杂性。如果 Go 选择采用 OS 线程而不是采用自己的调度器和递增的栈模式的话，那么他们能够在运行时中减少数千行的代码。对于很多用户场景来说，这确实是更好的模型。复杂性可以被语言和库的编写者抽象出去，这样软件工程师就能编写大量并发的程序了。

原文

Many seasoned engineers working in JVM based languages have seen errors like this:

[error] (run-main-0) java.lang.OutOfMemoryError: unable to create native thread:
[error] java.lang.OutOfMemoryError: unable to create native thread:
[error] 	at java.base/java.lang.Thread.start0(Native Method)
[error] 	at java.base/java.lang.Thread.start(Thread.java:813)
...
[error] 	at java.base/java.lang.Thread.run(Thread.java:844)

OutOfMemory…err…out of threads. On my laptop running Linux, this happens after a paltry 11500 threads.

If you try the same thing in Go by starting Goroutines that sleep indefinitely, you get a very different result. On my laptop, I got up to 70 million goroutines before I got bored. So why can you have so many more Goroutines than threads? The answer is a fun journey down the operating system and back up again. And this isn’t just an academic issue – it has real world implications for how you design software. I’ve run into JVM thread limits in production literally dozens of times, either because some bad code was leaking threads, or because an engineer simply wasn’t aware of the JVM’s thread limitations.

What’s a thread anyway?

The term “thread” can mean a lot of different things. In this post, I’m going to use it to refer to a logical thread. That is: a series of operations with are run in a linear order; a logical path of execution. CPUs can only execute about one logical thread per core truly concurrently. An inherent side effect: If you have more threads than cores, threads must be paused to allow other threads to do work, then later resumed when it’s their turn again. To support being paused and resumed, a thread minimally needs two things:

1. An instruction pointer of some kind. AKA: What line of code was I executing when I was paused?
2. A stack. AKA: What is my current state? The stack contains local variables as well as pointers to heap allocated variables. All threads within the same process share the same heap.

Given these two things, the system scheduling threads onto the CPU has enough information to pause a thread, allow other threads to run, then later resume the original thread where it left off. This operation is usually completely transparent to the threads. From the perspective of a thread, it is running continuously. The only way a thread could observe being descheduled is by measuring the time between subsequent operations.

Getting back to our original question: why can you have so many more Goroutines?

The JVM uses operating system threads

Although, it’s not required by the spec, all modern, general purpose JVMs that I’m aware of delegate threading to operating system threads on all platforms where this is possible. Going forward, I’ll use the phrase “user space threads” to refer to threads that are scheduled by the language instead of by the kernel/OS. Threads implemented by the operating system have two properties that drastically limit how many of them can exist; no solution that maps language threads 1:1 with operating system threads can support massive concurrency.

In the JVM: Fixed Stack Size

Using operating system threads incurs a constant, large, memory cost per thread.

The second major problem with operating system threads comes because each OS thread has its own fixed-size stack. Though the size is configurable, in a 64-bit environment, the JVM defaults to a 1MB stack per thread. You can make the default stack size smaller, but you tradeoff memory usage with increased risk of stack overflow. The more recursion in your code, the more likely you are to hit stack overflow. If you keep the default value, 1k threads will use almost 1GB of RAM! RAM is cheap these days, but almost no one has the terabyte of ram you’d need to run a million threads with this machinery.

How Go does it differently: Dynamically Sized Stacks

Golang prevents large (mostly unused) stacks running the system out of memory with a clever trick: Go’s stacks are dynamically sized, growing and shrinking with the amount of data stored. This isn’t a trivial thing to do, and the design has gone through a couple of iterations. While I’m not going to get into the internal details here (they’re more than enough for their own posts and others have written about it at length), the upshot is that a new goroutine will have a stack of only about 4KB. With 4KB per stack, you can put 2.5 million goroutines in a gigabyte of RAM – a huge improvement over Java’s 1MB per thread.

In the JVM: Context Switching Delay

Using operating system threads caps you in the double digit thousands, simply from context switching delay.

Because the JVM uses operating system threads, it relies on the operating system kernel to schedule them. The operating system has a list of all the running processes and threads, and attempts to give them each a “fair” share of time running on the CPU.5 When the kernel switches from one thread to another, it has a significant amount of work do. The new thread or process running must be started with a view of world that abstracts away the fact that other threads are running on the same CPU. I won’t the get into the nitty gritty here, but you can read more if you’re curious. The critical takeaway is that switching contexts will take on the order of 1-100µ seconds. This may not seem like much, but at a fairly realistic 10µ seconds per switch, if you want to schedule each thread at least once per second, you’ll only be able to run about 100k threads on 1 core. And this doesn’t actually give the threads time to do any useful work.

How Go does it differently: Run multiple Goroutines on a single OS thread

Golang implements its own scheduler that allows many Goroutines to run on the same OS thread. Even if Go ran the same context switching code as the kernel, it would save a significant amount of time by avoiding the need to switch into ring-0 to run the kernel and back again. But that’s just table stakes. To actually support 1 million goroutines, Go needs to do something much more sophisticated.

Even if JVM brought threads to user space, it still wouldn’t be able to support millions of threads. Suppose for a minute that in your new system, switching between new threads takes only 100 nanoseconds. Even if all you did was context switch, you could only run about a million threads if you wanted to schedule each thread ten times per second. More importantly, you’d be maxing out your CPU to do so. Supporting truly massive concurrency requires another optimization: Only schedule a thread when you know it can do useful work! If you’re running that many threads, only a handful can be be doing useful work anyway. Go facilitates this by integrating channels and the scheduler. If a goroutine is waiting on a empty channel, the scheduler can see that and it won’t run the Goroutine. Go goes one step further and actually sticks the mostly-idle goroutines on their own operating system thread. This way the (hopefully much smaller) number of active goroutines can be scheduled by one thread while the millions of mostly-sleeping goroutines can be tended to separately. This helps keep latency down.

Unless Java added language features that the scheduler could observe, supporting intelligent scheduling would be impossible. However, you can build runtime schedulers in “user space” that are aware of when a thread can do work. This forms the basis for frameworks like Akka that can support millions of actors.

Closing Thoughts

Transitioning from a model using operating system threads to a model using lightweight, user space threads has happened over and over again and will probably continue to happen. For use cases where a high degree of concurrency is required, it’s simply the only option. However, it doesn’t come without considerable complexity. If Go opted for OS threads instead of their own scheduler and growable-stack scheme, they would shave thousands of lines off the runtime. For many use cases, it’s simply a better model. The complexity can be abstracted away by language and library writers, and software engineers can write massively concurrent programs.

译文