Linux I/O 多路复用（select & poll & epoll）

Apr 5, 2021 12:15 · 4462 words · 9 minute read Linux OS

译文

假如你用 node.js 写一个 Linux web 服务，实际上在底层使用了 epoll Linux 系统调用。我们来谈谈 epoll 与 select 和 poll 的区别，以及它们的工作原理。

服务器需要监听很多文件描述符（file descriptors）

一个 web 服务器，每当通过 accept 系统调用建立连接，都会得到一个代表那条连接的文件描述符。作为一个 web 服务，同时可能会有成千上万的连接。你需要知道何时何地有数据发送给你，这样才能处理请求并应答。

你可能会写这么个循环：

for x in open_connections:
    if has_new_input(x):
        process_input(x)

如此处理的问题是会浪费掉很多 CPU 时间。与其消耗 CPU 时间去轮询“有更新吗？现在呢？现在呢？现在呢？”，不如我们直接告诉 Linux 内核：“嘿，这里有一百个文件描述符，只要其中某个更新了就告诉我。”

这三个系统调用可以让你告诉 Linux 去监控众多文件描述符，它们就是 select、poll 和 epoll。我们先从 select 与 poll 开始。

select & poll

所有 UNIX 系统都有这两个系统调用，而 epoll 是 Linux 特有的。它们俩的工作原理是：

传给它们一堆文件描述符
它们告诉你哪个文件描述符有数据可以读写了

一个令人惊讶的事实是，select 和 poll 的代码基本相同！

我看了 select 和 poll 的 Linux 源码后确认了。

它们都调用了很多相同的函数，poll 返回了一堆可能的文件描述符集合例如 POLLRDNORM | POLLRDBAND | POLLIN | POLLHUP | POLLERR 而 select 只告诉你 这是输入 / 这是输出 / 这是错误。

相比于 poll 更具体的返回，select 的粒度就比较粗了，比如“你可以写了”。你可以在这看 Linux 4.10 中相关代码。

我学到另一件事，在文件描述符不多的情况下，poll 的性能比 select 要好。

为了证明这点，你可以看看 poll 和 select 的方法签名：

int ppoll(struct pollfd *fds, nfds_t nfds,
          const struct timespec *tmo_p, const sigset_t
          *sigmask)`
int pselect(int nfds, fd_set *readfds, fd_set *writefds,
            fd_set *exceptfds, const struct timespec *timeout,
            const sigset_t *sigmask);

你要告诉 poll 函数“这些是我想要监控的文件描述符：1、3、8、9”（即 pollfd 参数）。而 select 函数，你只要告诉它“我想要监控 19 个文件描述符，其中某个进入 read/write/exception 状态”，当其运行时，会轮询这 19 个文件描述符，虽然你只关系其中几个。

这两个是最主要的差别。

为什么不使用 select 和 poll

我说过 node.js web 服务既不使用 select 也不用 poll，而是用了 epoll。为啥？

每次调用 select() 或者 poll，内核都会检查所有的文件描述符来确认它们是否准备好了。当监控的文件描述符很多时，需要的时间就很夸张。

一句话说：你每次调用 select 或 poll，内核要从头开始检查你的文件描述符是否可供写入。内核不记住它应该监控的文件描述符列表。

信号驱动 I/O

有两种让内核记住应当监听的文件描述符列表的方式：信号驱动 I/O 和 epoll。信号驱动 I/O 是通过调用 fcntl 让内核在有文件描述符更新数据时发送一个信号给你。我从没听过有人在用这个，epoll 就是更好的。所以我们直接忽略好了，下面聊聊 epoll。

水平触发（level-triggered） vs 边沿触发（edge-triggered）

在我们讨论 epoll 前，要先讲下 level-triggered 和 edge-triggered 两种文件描述符通知：

level-triggered：拿到每个都可读的且是你感兴趣的文件描述符列表
edge-triggered：每当有文件描述符可读时就收到一个通知

这两个概念来自电路，triggered 代表电路激活，也就是有事件通知给程序，level-triggered 表示只要有 IO 操作可以进行（比如某个文件描述符有数据可读），每次调用 epoll_wait 都会返回以通知程序可以进行 IO 操作；edge-triggered 表示只有在文件描述符状态发生变化时，调用 epoll_wait 才会返回，如果第一次没有全部读完该文件描述符的数据而且没有新数据写入，再次调用 epoll_wait 不会通知程序，因为文件描述符的状态没有变化。

什么是 epoll

现在终于说到 epoll 了。我见过不少 epoll_wait 但不知道起了杀作用。

epoll 类的系统调用（epoll_create、epoll_ctl、epoll_wait）给 Linux 内核一张文件夹描述符清单，追踪和检查数据更新。

下面是使用 epoll 的步骤：

调用 epoll_create 来告诉内核你要 epolling 了！它会返回给你一个 ID。
调用 epoll_ctl 来告诉内核你所关心的文件描述符。有趣的是，支持各种各样的文件描述符（pipe、FIFO、socket、POSIX message queues、inotify instances、device 等等），但不是常规的文件。我觉得这是合理的——pipe 和 socket 的 API 相当简单（一个进程写 pipe，另一个进程读），所以可以说“这个 pipe 有新数据可读了”。但是文件就比较奇怪，你可以往文件的中间写数据，不能简单地说“该文件有新数据可以读取”。
调用 epoll_wait 来等待你想要的文件有更新。

性能表现：select vs poll vs epoll

监听十万个操作：

operations	poll	select	epoll
10	0.61	0.73	0.41
100	2.9	3.0	0.42
1000	35	35	0.53
10000	990	930	0.66

当你要监控 10 个以上文件描述符，用 epoll 要快得多。

谁在用 epoll？

当我 strace 一个程序有时候会看到 epoll_wait，很明显它在监控某些文件描述符，但我们可以做的更好！

首先——如果你使用绿色线程或者时间循环，可能会使用 epoll 来完成所有的网络和 pipe I/O。

举个例子，下面是一个在 Linux 上使用 epoll 的 golang 程序：

package main

import "net/http"
import "io/ioutil"

func main() {
    resp, err := http.Get("http://example.com/")
        if err != nil {
            // handle error
        }
    defer resp.Body.Close()
    _, err = ioutil.ReadAll(resp.Body)
}

能看到 go 运行时利用 epoll 来做 DNS 查询：

16016 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.1.1")}, 16 <unfinished ...>
16020 socket(PF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP
16016 epoll_create1(EPOLL_CLOEXEC <unfinished ...>
16016 epoll_ctl(5, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=334042824, u64=139818699396808}}
16020 connect(4, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.1.1")}, 16 <unfinished ...>
16020 epoll_ctl(5, EPOLL_CTL_ADD, 4, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=334042632, u64=139818699396616}}

基本上它所做的就是连接两个 socket（文件描述符 3 和 4）来做 DNS 查询（向 127.0.0.1:53）,然后使用 epoll_ctl 来给我们更新。

接着发送了两条关于 example.com 的 DNS 查询（为啥是两个？nelhage 显示其中一个查询 A 记录，另一个查询 AAAA 记录）,并使用 epoll_wait 等待应答：

# these are DNS queries for example.com!
16016 write(3, "\3048\1\0\0\1\0\0\0\0\0\0\7example\3com\0\0\34\0\1", 29
16020 write(4, ";\251\1\0\0\1\0\0\0\0\0\0\7example\3com\0\0\1\0\1", 29
# here it tries to read a response but I guess there's no response
# available yet
16016 read(3,  <unfinished ...>
16020 read(4,  <unfinished ...>
16016 <... read resumed> 0xc8200f4000, 512) = -1 EAGAIN (Resource temporarily unavailable)
16020 <... read resumed> 0xc8200f6000, 512) = -1 EAGAIN (Resource temporarily unavailable)
# then it uses epoll to wait for responses
16016 epoll_wait(5,  <unfinished ...>
16020 epoll_wait(5,  <unfinished ...>

那么 go/node.js/Python 用什么库来 epoll？

node.js 用了 libuv
Python 中的 gevent 网络库使用 libev/libevent
golang 用了一些自定义的代码，因为它可以自举。go 运行时用 epoll 实现的网络轮询——只有一百行代码，很有意思。在 BSD 中是用 kqueue 实现的。

nginx 也实现 epoll，nginx 中 epoll 相关代码：https://github.com/golang/go/blob/91c9b0d568e41449f26858d88eb2fd085eaf306d/src/runtime/netpoll_epoll.go

原文

For example if you’re writing a web server in node.js on Linux, it’s actually using the epoll Linux system call under the hood. Let’s talk about why, how epoll is different from poll and select, and about how it works!

Servers need to watch a lot of file descriptors

Suppose you’re a webserver. Every time you accept a connection with the accept system call (here’s the man page), you get a new file descriptor representing that connection.

If you’re a web server, you might have thousands of connections open at the same time. You need to know when people send you new data on those connections, so you can process and respond to them.

You could have a loop that basically does:

for x in open_connections:
    if has_new_input(x):
        process_input(x)

The problem with this is that it can waste a lot of CPU time. Instead of spending all CPU time to ask “are there updates now? how about now? how about now? how about now?“, instead we’d rather just ask the Linux kernel “hey, here are 100 file descriptors. Tell me when one of them is updated!“.

The 3 system calls that let you ask Linux to monitor lots of file descriptors are poll, epoll and select. Let’s start with poll and select because that’s where the chapter started.

First way: select & poll

These 2 system calls are available on any Unix system, while epoll is Linux-specific. Here’s basically how they work:

Give them a list of file descriptors to get information about
They tell you which ones have data available to read/write to

The first surprising thing I learned from this chapter are that poll and select fundamentally use the same code.

I went to look at the definition of poll and select in the Linux kernel source to confirm this and it’s true!

here’s the definition of the select syscall and do_select
and the definition of the poll syscall and do_poll

They both call a lot of the same functions. One thing that the book mentioned in particular is that poll returns a larger set of possible results for file descriptors like POLLRDNORM | POLLRDBAND | POLLIN | POLLHUP | POLLERR while select just tells you “there’s input / there’s output / there’s an error”.

select translates from poll’s more detailed results (like POLLWRBAND) into a general “you can write”. You can see the code where it does this in Linux 4.10 here.

The next thing I learned is that poll can perform better than select if you have a sparse set of file descriptors.

To see this, you can actually just look at the signatures for poll and select!

int ppoll(struct pollfd *fds, nfds_t nfds,
          const struct timespec *tmo_p, const sigset_t
          *sigmask)`
int pselect(int nfds, fd_set *readfds, fd_set *writefds,
            fd_set *exceptfds, const struct timespec *timeout,
            const sigset_t *sigmask);

With poll, you tell it “here are the file descriptors I want to monitor: 1, 3, 8, 19, etc” (that’s the pollfd argument. With select, you tell it “I want to monitor 19 file descriptors. Here are 3 bitsets with which ones to monitor for reads / writes / exceptions.” So when it runs, it loops from 0 to 19 file descriptors, even if you were actually only interested in 4 of them.

There are a lot more specific details about how poll and select are different in the chapter but those were the 2 main things I learned!

why don’t we use poll and select?

Okay, but on Linux we said that your node.js server won’t use either poll or select, it’s going to use epoll. Why?

From the book:

On each call to select() or poll(), the kernel must check all of the specified file descriptors to see if they are ready. When monitoring a large number of file descriptors that are in a densely packed range, the timed required for this operation greatly outweights [the rest of the stuff they have to do]

Basically: every time you call select or poll, the kernel needs to check from scratch whether your file descriptors are available for writing. The kernel doesn’t remember the list of file descriptors it’s supposed to be monitoring!

Signal-driven I/O (is this a thing people use?)

The book actually describes 2 ways to ask the kernel to remember the list of file descriptors it’s supposed to be monitoring: signal-drive I/O and epoll. Signal-driven I/O is a way to get the kernel to send you a signal when a file descriptor is updated by calling fcntl. I’ve never heard of anyone using this and the book makes it sound like epoll is just better so we’re going to ignore it for now and talk about epoll.

level-triggered vs edge-triggered

Before we talk about epoll, we need to talk about “level-triggered” vs “edge-triggered” notifications about file descriptors. I’d never heard this terminology before (I think it comes from electrical engineering maybe?). Basically there are 2 ways to get notifications

get a list of every file descriptor you’re interested in that is readable (“level-triggered”)
get notifications every time a file descriptor becomes readable (“edge-triggered”)

what’s epoll?

Okay, we’re ready to talk about epoll!! This is very exciting to because I’ve seen epoll_wait a lot when stracing programs and I often feel kind of fuzzy about what it means exactly.

The epoll group of system calls (epoll_create, epoll_ctl, epoll_wait) give the Linux kernel a list of file descriptors to track and ask for updates about whether

Here are the steps to using epoll:

Call epoll_create to tell the kernel you’re gong to be epolling! It gives you an id back
Call epoll_ctl to tell the kernel file descriptors you’re interested in updates about. Interestingly, you can give it lots of different kinds of file descriptors (pipes, FIFOs, sockets, POSIX message queues, inotify instances, devices, & more), but not regular files. I think this makes sense – pipes & sockets have a pretty simple API (one process writes to the pipe, and another process reads!), so it makes sense to say “this pipe has new data for reading”. But files are weird! You can write to the middle of a file! So it doesn’t really make sense to say “there’s new data available for reading in this file”.
Call epoll_wait to wait for updates about the list of files you’re interested in.

performance: select & poll vs epoll

In the book there’s a table comparing the performance for 100,000 monitoring operations:

operations	poll	select	epoll
10	0.61	0.73	0.41
100	2.9	3.0	0.42
1000	35	35	0.53
10000	990	930	0.66

So using epoll really is a lot faster once you have more than 10 or so file descriptors to monitor.

who uses epoll?

I sometimes see epoll_wait when I strace a program. Why? There is the kind of obvious but unhelpful answer “it’s monitoring some file descriptors”, but we can do better!

First – if you’re using green threads or an event loop, you’re likely using epoll to do all your networking & pipe I/O!

For example, here’s a golang program that uses epoll on Linux!

package main

import "net/http"
import "io/ioutil"

func main() {
    resp, err := http.Get("http://example.com/")
        if err != nil {
            // handle error
        }
    defer resp.Body.Close()
    _, err = ioutil.ReadAll(resp.Body)
}

Here you can see the golang run time using epoll to do a DNS lookup:

16016 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.1.1")}, 16 <unfinished ...>
16020 socket(PF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP
16016 epoll_create1(EPOLL_CLOEXEC <unfinished ...>
16016 epoll_ctl(5, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=334042824, u64=139818699396808}}
16020 connect(4, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.1.1")}, 16 <unfinished ...>
16020 epoll_ctl(5, EPOLL_CTL_ADD, 4, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=334042632, u64=139818699396616}}

Basically what this is doing is connecting 2 sockets (on file descriptors 3 and 4) to make DNS queries (to 127.0.1.1:53), and then using epoll_ctl to ask epoll to give us updates about them.

Then it makes 2 DNS queries for example.com (why 2? nelhage suggests one of them is querying for the A record, and one for the AAAA record!), and uses epoll_wait to wait for replies.

# these are DNS queries for example.com!
16016 write(3, "\3048\1\0\0\1\0\0\0\0\0\0\7example\3com\0\0\34\0\1", 29
16020 write(4, ";\251\1\0\0\1\0\0\0\0\0\0\7example\3com\0\0\1\0\1", 29
# here it tries to read a response but I guess there's no response
# available yet
16016 read(3,  <unfinished ...>
16020 read(4,  <unfinished ...>
16016 <... read resumed> 0xc8200f4000, 512) = -1 EAGAIN (Resource temporarily unavailable)
16020 <... read resumed> 0xc8200f6000, 512) = -1 EAGAIN (Resource temporarily unavailable)
# then it uses epoll to wait for responses
16016 epoll_wait(5,  <unfinished ...>
16020 epoll_wait(5,  <unfinished ...>

So one reason your program might be using epoll “it’s in Go / node.js / Python with gevent and it’s doing networking”.

What libraries do go/node.js/Python use to use epoll?

node.js uses libuv (which was written for the node.js project)
the gevent networking library in Python uses libev/libevent
golang uses some custom code, because it’s Go. This looks like it might be the implementation of network polling with epoll in the golang runtime – it’s only about 100 lines which is interesting. You can see the general netpoll interface here – it’s implemented on BSDs with kqueue instead

Webservers also implement epoll – for example here’s the epoll code in nginx.

more select & epoll reading

I liked these 3 posts by Marek:

In particular these talk about how epoll’s support for multithreaded programs has not historically been good, though there were some improvements in Linux 4.5.

and this:

using select (2) the right way