Linux CFS 调度器

Apr 7, 2024 20:30 · 3886 words · 8 minute read Linux

原文：https://github.com/torvalds/linux/blob/v5.4/Documentation/scheduler/sched-design-CFS.rst

1. 概述

CFS 代表“完全公平调度器”（Completely Fair Scheduler），在 Linux 2.6.23 中合并，代替之前调度器的 SCHED_OTHER 代码。

CFS 绝大部分设计都可以概括为一句话：将物理硬件建模为一个“理想化的、精确的多任务 CPU”。

“理想化的多任务 CPU”（并不存在）具有百分百的物理能力，并能以相等的速度精确地并行运行每个任务。假设有两个任务在跑，它将以 50% 的物理能力运行每个任务——也就是说，实际上是并行的。

在真实硬件上，我们一次只能运行一个任务，所以要引入“虚拟运行时间”（virtual runtime）的概念。任务的虚拟运行时间指定了其下一个时间片在上述的理想化多任务 CPU 上何时开始执行。实际上任务的虚拟运行时间是其实际运行时间根据正在运行的任务总数标准化得来的。

2. 一点实现细节

在 CFS 中虚拟运行时间通过每个任务的 p->se.vruntime（纳秒单位）值表示和追踪。通过这种方式可以准确地标记和测量任务应该获得的“预期 CPU 时间”。

在理想化的硬件上，任意时间所有任务都有着相同的 p->se.vruntime 值，即任务会同时执行，且没有任务会从“理想的” CPU 时间份额中失衡。

CFS 的任务选择逻辑就基于 p->se.vruntime，因此非常简单：它总会试图运行有着最小 p->se.vruntime 值的任务（即到目前为止执行次数最少的任务）。CFS 始终尝试将 CPU 时间尽可能以接近“理想化的多任务硬件”的方式分配给可运行的任务。

大部分设计都遵循这个非常简单的概念，再辅以一些附加的修饰，就像优先级、多进程和各种算法变体来识别休眠者。

Linux 中每个任务（无论是进程还是线程）都用 task_struct 结构表示，上文中的 p 对象就是一个 task_struct 实例。这是一个相当长的结构体，包含各种类型的数据（任务状态、运行统计信息、进程亲缘关系等等），调度实体也在其中：

struct task_struct {
    // a lot of code here

    int    on_rq; // 是否运行在队列上
    // 优先级
    int    prio;
    int    static_prio;
    int    normal_prio;
    unsigned int   rt_priority;
    // 调度器类
    const struct sched_class *sched_class;
    // 调度实体
    struct sched_entity  se;
    struct sched_rt_entity  rt;

    struct sched_dl_entity  dl;

    unsigned int   policy; // 调度策略
    int    nr_cpus_allowed;
};

p->se 对象是 sched_entity 结构体的实例：

struct sched_entity {
    /* For load-balancing: */
    struct load_weight  load;
    unsigned long   runnable_weight;
    struct rb_node   run_node;
    struct list_head  group_node;
    unsigned int   on_rq;

    u64    exec_start;
    u64    sum_exec_runtime;
    u64    vruntime;
    u64    prev_sum_exec_runtime;

    u64    nr_migrations;

    struct sched_statistics  statistics;

#ifdef CONFIG_FAIR_GROUP_SCHED
    int    depth;
    struct sched_entity  *parent;
    /* rq on which this entity is (to be) queued: */
    struct cfs_rq   *cfs_rq;
    /* rq "owned" by this entity/group: */
    struct cfs_rq   *my_q;
#endif

    // a lot of code here
};

3. 红黑树

CFS 的设计相当激进：不使用旧的数据结构来处理运行队列，而是用按时间排序的红黑树来构建一个未来任务执行的时间线，因此没有“数组切换”这种影响。

CFS 也维护 rq->cfs.min_vruntime，这个单调递增的值用于追踪运行队列中所有任务的最小虚拟运行时间。系统完成的工作总量使用 min_vruntime 来追踪，该值被用来尽可能地将新激活的实体放在树左侧。

CFS 维护一棵按时间排序的红黑树，其中所有可运行的任务都按照 p->se.vruntime 键进行排序。CFS 从这棵树中选择“最左边的”任务，随着系统向前推进，越来越多执行过的任务被放入树的右侧——虽然慢但每个任务都一定可能成为“最左侧的任务”，因此在确定的时间内获得 CPU。

总的来说：CFS 每次运行一个任务一会，当任务调度时，该任务的 CPU 使用情况会被“记账”，它刚才使用物理 CPU 的时间会被加到 p->se.vruntime，当 p->se.vruntime 膨胀到一定程度，那么另一个任务就会被安排到红黑树中最左侧，那么新的最左侧任务会被选取，当前任务会被抢占。

Linux 为每个 CPU 维护一个 rq（runqueue）结构对象，表示在该 CPU 上运行的所有任务：

/*
 * This is the main, per-CPU runqueue data structure.
 *
 * Locking rule: those places that want to lock multiple runqueues
 * (such as the load balancing or the thread migration code), lock
 * acquire operations must be ordered by ascending &runqueue.
 */
struct rq {
    /* runqueue lock: */
    raw_spinlock_t  lock;

    /*
     * nr_running and cpu_load should be in the same cacheline because
     * remote CPUs use both these fields when doing load calculation.
     */
    unsigned int  nr_running;

    unsigned long  nr_load_updates;
    u64   nr_switches;

    struct cfs_rq  cfs; // CFS 运行队列
    struct rt_rq  rt; // 实时运行队列
    struct dl_rq  dl;

    // a lot of code here
};

在调度时调度器首先去实时队列查找是否有实时进程需要运行，如果没有的话再去 CFS 运行队列查找，即 rq.cfs 对象，在 Linux 中其结构体定义是 cfs_rq。

struct cfs_rq {
    struct load_weight load;
    unsigned long  runnable_weight;
    unsigned int  nr_running;
    unsigned int  h_nr_running;      /* SCHED_{NORMAL,BATCH,IDLE} */
    unsigned int  idle_h_nr_running; /* SCHED_IDLE */

    u64   exec_clock;
    u64   min_vruntime;

    struct rb_root_cached tasks_timeline;

    /*
     * 'curr' points to currently running entity on this cfs_rq.
     * It is set to NULL otherwise (i.e when none are currently running).
     */
    struct sched_entity *curr;
    struct sched_entity *next;
    struct sched_entity *last;
    struct sched_entity *skip;

    // a lot of code here
};

上文所说的时间线红黑树在 rb_root_cached 结构中：

struct rb_node {
    unsigned long  __rb_parent_color;
    struct rb_node *rb_right;
    struct rb_node *rb_left;
} __attribute__((aligned(sizeof(long))));
    /* The alignment might seem pointless, but allegedly CRIS needs it */

struct rb_root {
    struct rb_node *rb_node;
};

/*
 * Leftmost-cached rbtrees.
 *
 * We do not cache the rightmost node based on footprint
 * size vs number of potential users that could benefit
 * from O(1) rb_last(). Just not worth it, users that want
 * this feature can always implement the logic explicitly.
 * Furthermore, users that want to cache both pointers may
 * find it a bit asymmetric, but that's ok.
 */
struct rb_root_cached {
    struct rb_root rb_root;
    struct rb_node *rb_leftmost;
};

rb_root 指向红黑树根节点，rb_leftmost 指向最左侧的节点（任务）。

每个进程/线程都有 task_struct，task_struct 对象中都有“调度实体” sched_entity，而 sched_entity 对象中就有一个红黑树节点。这样 sched_entity 就和红黑树挂接起来了。

4.一些特性

CFS 使用纳秒级的粒度，不依赖于任何 jiffies 或其他 HZ 细节。因此 CFS 调度器不像以前的那些有“时间片”的概念，只有一个参数可调（得打开 CONFIG_SCHED_DEBUG 才行）：

/proc/sys/kernel/sched_min_granularity_ns

该内核参数可以用来调整调度器的负载类型，比如桌面型负载（低延迟）和服务器型负载（良好的批处理）。

正因为它的设计，CFS 调度器不易受到现有针对调度器算法的攻击。

CFS 调度器在处理优先级和 SCHED_BATCH 方面比之前的调度器要强得多：这两类工作负载被更加积极地隔离。

SMP 负载均衡器已经重新设计：现在运行队列的遍历假设已经从负载均衡代码中删除，并用了调度模块的迭代器。因此代码简单了许多。

5.调度策略

CFS 实现了三种调度策略：

SCHED_NORMAL：用于常规任务（普通进程）的调度策略。
SCHED_BATCH：（后台进程）不像常规任务那样经常抢占，从而允许任务运行更久并更好地使用缓存，但是会以交互性为代价。非常适合批处理作业。
SCHED_IDLE：比 nice 19 优先级还低，但不是真正空闲的调度器计时器，避免陷入优先级反转问题，这会导致机器死锁。
SCHED_FIFO：优先级一样的，先来先得；优先级更高的可以抢占低优先级。
SCHED_RR：优先级一样的，轮着来；任务用完时间片后被放到队尾；优先级更高的也可以抢占低优先级。
SCHED_DEADLINE：按任务的截止时间调度，选择距离当前时间点最近的任务。

SCHED_FIFO/SCHED_RR 都在 sched/rt.c 中实现，遵循 POSIX。

调度策略也在 task_struct 结构中，配合优先级一起：

struct task_struct {
    // a lot of code here

    int    prio;
    int    static_prio;
    int    normal_prio;

    unsigned int   policy; // 调度策略
};

还有调度策略的定义 https://github.com/torvalds/linux/blob/v5.4/tools/include/uapi/linux/sched.h#L76-L85：

/*
 * Scheduling policies
 */
#define SCHED_NORMAL  0
#define SCHED_FIFO  1
#define SCHED_RR  2
#define SCHED_BATCH  3
/* SCHED_ISO: reserved but not implemented yet */
#define SCHED_IDLE  5
#define SCHED_DEADLINE  6

6. 调度类

CFS 调度器被设计成可扩展的调度模块层次结构，来引入“调度类”（Scheduling Classes）。模块封装了调度策略的细节，并由核心调度器处理。

sched/fair.c 实现了上述的 CFS 调度器
sched/rt.c 实现了 SCHED_FIFO 和 SCHED_RR 语义，比之前的调度器简单。对于 RT 任务，它使用 100 优先级（0 ~ 99）而非之前的 140，无需过期数组。

调度类通过 sched_class 结构实现，包含了某些事件发生时要调用的一些钩子函数：

enqueue_task()

当任务进入可运行状态时调用。将“调度实体”放入红黑树并增加 nr_running 变量的值。
dequeue_task()

当任务不再可运行，调用该方法将相应的“调度实体”脱钩红黑树。会减少 nr_running 变量的值。
yield_task()

基本上只是出队后跟上一个入队操作，除非 compat_yield 内核参数打开；在打开的情况下，会将“调度实体”置于红黑树的最右侧。
check_preempt_curr()

当一个任务进度可运行状态，检查是否应该抢占当前正在运行的任务。
pick_next_task()

选择最适合接下来运行的任务。
set_curr_task()

当任务改变其调度类或任务组时调用。
task_tick()

主要是定时器调用；可能导致进程切换。驱动任务抢占。

每个任务都有指定的“调度类”，即 task_struct 结构体的 sched_class 字段：

struct task_struct {
    // a lot of code here

    const struct sched_class *sched_class;
};

Linux 中 sched_class 的定义：

struct sched_class {
    const struct sched_class *next;

#ifdef CONFIG_UCLAMP_TASK
    int uclamp_enabled;
#endif

    void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
    void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
    void (*yield_task)   (struct rq *rq);
    bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);

    void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);

    struct task_struct * (*pick_next_task)(struct rq *rq,
                           struct task_struct *prev,
                           struct rq_flags *rf);
    void (*put_prev_task)(struct rq *rq, struct task_struct *p);
    void (*set_next_task)(struct rq *rq, struct task_struct *p);

    void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
    void (*task_fork)(struct task_struct *p);
    void (*task_dead)(struct task_struct *p);

    void (*switched_from)(struct rq *this_rq, struct task_struct *task);
    void (*switched_to)  (struct rq *this_rq, struct task_struct *task);
    void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
                  int oldprio);

    unsigned int (*get_rr_interval)(struct rq *rq,
                    struct task_struct *task);

    void (*update_curr)(struct rq *rq);
};

几种调度类 https://github.com/torvalds/linux/blob/v5.4/kernel/sched/sched.h#L1798-L1802：

extern const struct sched_class stop_sched_class;
extern const struct sched_class dl_sched_class;
extern const struct sched_class rt_sched_class;
extern const struct sched_class [fair_sched_class];
extern const struct sched_class idle_sched_class;

stop_sched_class

优先级最高的任务所使用，会中断其他所有任务，且不会被其他任务打断
dl_sched_class（deadline）

对应 deadline 调度策略
rt_sched_class（realtime）

对应 SCHED_FIFO 和 SCHED_RR 调度策略
fair_sched_class

普通任务的调度策略（SCHED_NORMAL）
idle_sched_class

空闲任务的调度策略（SCHED_IDLE）

7. 调度器扩展

通常调度器操作单个任务，并努力为每个任务提供公平的 CPU 时间。有时候我们可能希望将任务分组并为组内的每个任务提供公平的 CPU 时间。比如为系统上的某个用户的每个任务提供公平的 CPU 时间。

CONFIG_CGROUP_SCHED 实现上面的需求
CONFIG_RT_GROUP_SCHED 允许对实时任务（SCHED_FIFO 和 SCHED_RR）分组
CONFIG_FAIR_GROUP_SCHED 允许对 CFS 任务分组（SCHED_NORMAL 和 SCHED_BATCH）分组

这些操作需要定义 CONFIG_CGROUPS，使用 cgroup 伪文件系统为任务分组。详见 https://github.com/torvalds/linux/blob/v5.4/Documentation/admin-guide/cgroup-v1/cgroups.rst。

定义了 CONFIG_FAIR_GROUP_SCHED 后，会为每个组创建一个 cpu.shares 文件：

mount -t tmpfs cgroup_root /sys/fs/cgroup
mkdir /sys/fs/cgroup/cpu
mount -t cgroup -ocpu none /sys/fs/cgroup/cpu
cd /sys/fs/cgroup/cpu

mkdir multimedia  # create "multimedia" group of tasks
mkdir browser     # create "browser" group of tasks

# #Configure the multimedia group to receive twice the CPU bandwidth
# #that of browser group

echo 2048 > multimedia/cpu.shares
echo 1024 > browser/cpu.shares

# firefox & # Launch firefox and move it to "browser" group
echo <firefox_pid> > browser/tasks

# #Launch gmplayer (or your favourite movie player)
echo <movie_player_pid> > multimedia/tasks