现代操作系统真的需要 swap 吗？

Apr 3, 2021 22:00 · 2672 words · 6 minute read Linux OS

译文

介绍

虚拟内存管理（VMM）是内核中的代码，帮助我们为每个进程提供各自的虚拟地址空间。
内存超配（Overcommitment）表示进程可以向内核申请比可用物理内存和 swap 更多的内存。要注意这只是允许申请更多的内存，写入超过可用内存是不可能的。
页（Page）是一片内存
分页（Paging）是将页面复制到交换设备的操作
交换（Swapping）是在物理内存和交换设备之间复制页面

什么时候交换？

交换用于为进程腾出内存空间，即使系统的物理内存已经用光了。在正常的系统配置中，当系统面临内存压力时，就会使用 swap，后来内存压力消失，系统回归正常时，就不再使用 swap 了。在这种典型场景下，swap 帮助度过了内存紧缺的时间，而代价是交换时性能降低。

被交换的页面除非被访问到不会回到内存中。这就是有时候系统明明没有主动交换，计数器仍显示被交换的页面的原因。

我们来看看 RAM、SSD 和机械硬盘的速度。RAM 在 100 纳秒左右，访问 SSD 上的数据是 150 微秒（RAM 的 1500 倍了），而访问机械硬盘上的数据要 10 毫秒（RAM 的十万倍！）。

再来看一下 2GB RAM，2GB 交互设备，使用机械硬盘的系统。开机后，系统进程会用掉一部分 RAM。下图是用 Performance Co-Pilot 创建的，RHEL 的一部分。

刚开机的系统大多数内存都是空闲的。
现在我们启动一些进程，申请并写 RAM 的很小一部分，并从磁盘读取数据。闲置的内存会被内核用来缓存正在读取的数据。如果这些数据被再次请求，从缓存中读就行了。
现在我们开始写 RAM，内核首先会拿出用于磁盘缓存的 RAM 来给进程。一旦进程申请的内存比可用物理内存还多，内核就开始利用 swap 了。这对进程来说透明，只是 swap 的速度比物理 RAM 慢了好几个数量级。
要是进程尝试写入比包含了 swap 更多的内存，会触发 OOM（Out of Memory），他会被干掉。

swap 是如何使用的？

过去，有些应用建议 swap 的大小与 RAM 相等，甚至两倍。上面的系统 RAM 为 2GB，swap 也是 2GB。系统中的一个数据库被错误地配置为使用 5GB RAM。一旦物理内存耗尽，就会用到 swap。由于 swap 盘的速度比 RAM 慢得多，性能就会下降，出现难以预测的问题。这时，连登录系统不可能。随着越来越多的内存被写入，最终物理内存和 swap 内存都被消耗殆尽，OOM 杀手就来了。在我们的案例中，有不少 swap 可用，所以有相当一段时间性能不好。

相反的，我们再来想象下不使用 swap 时的上述案例。当系统耗尽内存后，没有 swap 来后备。几乎没有性能降低的时段，OOM 会立即启动。

管理员根本来不及反应，也不可能采取行动来解决这个问题，应用程序丢数据板上钉钉。
运维人员在事故后收到通知，只能事后诸葛亮。

我们对大多数现代操作系统的建议是，swap 只设置物理内存的一部分大小，例如 20%。这样一来，我们案例中痛苦的慢速运行阶段就不会持续很久，OOM 也会提前启动。

当然，有些时候需要不同的做法。如果能考虑到后果，这样配置 swap 也可以，就像在没有任何 swap 的情况下运行系统。

如今建议使用多少 swap？

具体问题具体分析，但配置 20% 的 RAM 大小的 swap 通常是个不错的选择。

可以关掉 swap 吗？是否要进一步调教？

没有 swap 的操作系统也是有意义的，只要清楚在内存有压力时系统的行为是符合你预期的。在大多数环境中还是要一点 swap 的。

/proc/meminfo Committed_AS 字段显示进程申请了多少内存
使用 sysctl 我们可以打开/关闭超配，并且配置允许多少超配。只有在极少数情况下经过充分测试才需要修改默认值。
这里有一份详细说明 swap 的文档——比如修改 vm.swappiness，这也需要对你的应用程序进行良好的测试。
没有 swap，内存耗尽时系统会调用 OOM。你可以设置 oom_adj_score 来调整进程被杀的优先级。
要是你开发应用程序，想要将页面锁在 RAM 并防止它们被交换，可以用 mlock()。
如果你的应用程序常规使用 swap，最好用更快的设备，比如 SSD。
Storage Administration Guide 也有 swap 配置的段落。

原文

Introduction: Important terms

Virtual Memory Management (VMM) is code in the kernel which, among other things, helps us to present each process with its own virtual address space.
Overcommitment means that processes can request more memory from the kernel than we have available physically and as swap. Please note that this is only allowing us to request more, writing to more than available is not possible.
A page is a piece of memory.
Paging is the operation of making a copy of a page, for example to a swap device.
Swapping is the usage of swap devices, so making copies of pages between physical memory and a swap device.
Thrashing is when swapping uses more resources than the other processes on the system.

Overview: When is swap used?

Swap is used to give processes room, even when the physical RAM of the system is already used up. In a normal system configuration, when a system faces memory pressure, swap is used, and later when the memory pressure disappears and the system returns to normal operation, swap is no longer used. In this typical situation, swap helped through the time of memory shortage, at the cost of reduced performance while swapping.

Pages that were stored in swap are not moved back into RAM unless they are accessed/requested. This is the reason that many counters show swapped pages, although the system might already be in a state where no active swapping is happening.

Let’s look at typical speed scales of RAM, SSD and rotating disks. A typical reference to RAM is in the area of 100ns, accessing data on a SSD 150μs (so 1500 times of the RAM) and accessing data on a rotating disk 10ms (so 100.000 times the RAM).

Let’s look at the behaviour of a system with 2GB of RAM, a 2GB swap device, and a hard disk. Just after boot, system processes are using only a few dozens of MB of RAM. This illustration was created with Performance Co-Pilot, which is part of Red Hat Enterprise Linux.

A just booted system, most memory is free.
Now we start processes, requesting and writing to only a small part of RAM, and reading data from disk. As our RAM would otherwise be unused, the kernel is using it to cache the data we are reading. If data is requested a second time, it is available from the cache.
Now we start processes, writing to more RAM. The kernel will now sacrifice the RAM used for disk caching and hand it out to the processes. Once the processes are requesting more memory than we have physically available, the kernel starts to utilize the swap. The process itself sees no difference - just that the swap is by orders of magnitudes slower than physical RAM.
If processes attempt to write to more memory than available including swap, the Out of Memory (OOM) handler has to decide for a process, and kill it.

The details: How is swap used?

In the past, some application vendors recommended swap of a size equal to the RAM, or even twice the RAM. Now let us imagine the above-mentioned system with 2GB of RAM and 2GB of swap. A database on the system was by mistake configured for a system with 5GB of RAM. Once the physical memory is used up, swap gets used. As the swap disk is much slower than RAM, the performance goes down, and thrashing occurs. At this point, even logins into the system might become impossible. As more and more memory gets written to, eventually both physical- and swap memory are completely exhausted and the OOM killer kicks in, killing one or more processes. In our case, quite a lot of swap is available, so the time of poor performance is long.

Now, let us imagine the above situation with no swap configured. As the system runs out of RAM, it has no swap to hand out. There is almost no time frame of reduced performance - the OOM kicks in immediately. So in this case:

The admins have no timeframe to react and possibly take countermeasures to maybe solve the issue without the application losing data. They might decide to reset the system themselves.
The admin teams get a notification after the incident and can then only analyze the issue.

Our size recommendation for most modern systems is ‘a part of the physical RAM’, for example, 20%. With this, the painfully slow phase of operation in our example will not last as long, and the OOM kicks in earlier.

Of course, there are scenarios when different behaviour is desired. When aware of the behaviour, such swap configurations are ok, as well as running the system without any swap. Such a system is supported by us as well - but the customer should know the behaviour in the above situations.

How much swap is recommended nowadays?

This depends on the desired behaviour of the system, but configuring an amount of 20% of the RAM as swap is usually a good idea.

Can I run without swap? Is further tuning possible?

Systems without swap can make sense and are supported by Red Hat - just be sure the behaviour of such a system under memory pressure is what you want. In most environments, a bit of swap makes sense.

/proc/meminfo Committed_AS field shows how much memory processes have requested.
Using sysctl, we can enable/disable overcommit, and configure how much overcommit should be allowed. The defaults need to be changed only in rare cases, and after properly testing the new settings
A solution document with details regarding the likeliness of swapping - for example in changing vm.swappiness. This also requires good testing with your applications.
Without swap, the system will call the OOM when the memory is exhausted. You can prioritize which processes get killed first in configuring oom_adj_score.
If you write an application, want to lock pages into RAM and prevent them from getting swapped, mlock() can be used.
If you design your applications to regularly use swap, make sure to use faster devices, like SSD - starting with Red Hat Enterprise Linux 7.1, ‘swapon –discard’ can be used to send TRIM to SSD devices, to discard the device contents on swapon.
The Storage Administration Guide has also a section on swap configuration.

译文

介绍