Skip to main content

Operating Systems · 45 min

CPU Scheduling

Many runnable threads, few cores. FIFO, SJF, round-robin, MLFQ, and Linux CFS define the path from simple policies to fair-share kernels.

Why This Matters

A 64-core machine running 2,000 runnable threads has at least 1,936 threads waiting at any instant. The scheduler chooses which 64 run now, which cache lines stay hot, which request waits 2 ms rather than 200 ms, and which batch job gets half a socket instead of all of it.

Bad scheduling has visible signatures. FIFO can make one 1-second CPU-bound job delay fifty 10-ms interactive jobs. A time quantum of 100 microseconds can burn much of a core in context-switch overhead. A fair-share scheduler that ignores NUMA placement can move a thread from local DRAM latency near 90 ns to remote latency above 140 ns on common dual-socket servers.

Core Definitions

Definition

Turnaround time

For a job jj, turnaround time is Tj=CjAjT_j = C_j - A_j, where AjA_j is arrival time and CjC_j is completion time. Batch systems often minimize average turnaround 1njTj\frac{1}{n}\sum_j T_j.

Definition

Response time

For an interactive job jj, response time is Rj=SjAjR_j = S_j - A_j, where SjS_j is the first time the job runs. A shell command that arrives at time 0 and first runs at time 12 ms has response time 12 ms even if it completes at 90 ms.

Definition

Throughput

Throughput is completed jobs per unit time. If 240 requests finish in 2 seconds, throughput is 120 requests per second. High throughput and low tail latency often conflict because batching improves cache locality and amortizes overhead.

Definition

Fairness

A common fairness target is proportional share. If tasks AA and BB have weights 1024 and 512, then over a long interval AA should receive about 10241536=23\frac{1024}{1536}=\frac{2}{3} of the CPU and BB about 13\frac{1}{3}.

Metrics With a Worked Schedule

Consider one CPU and three jobs arriving at time 0.

JobCPU burst
A24 ms
B3 ms
C3 ms

Under FIFO with order A, B, C, the execution line is:

time: 0                      24   27   30
      |---------- A ----------| B | C |

Completion times are CA=24C_A=24, CB=27C_B=27, CC=30C_C=30. Since all arrive at 0, turnaround times are 24, 27, and 30 ms, with average 27 ms. Response times are 0, 24, and 27 ms, with average 17 ms.

Under shortest-job-first, the line is:

time: 0   3   6                      30
      | B | C |---------- A ----------|

Turnaround times are 3, 6, and 30 ms, with average 13 ms. Response times are 0, 3, and 6 ms if the order is B, C, A. The same workload has the same CPU demand, 30 ms, but very different user-visible waiting.

Fairness is a separate metric. If A, B, and C are long-running services with equal weights, a scheduler that gives A the first 24 ms is not fair over short windows even though it has zero idle time.

FIFO, SJF, and the Convoy Effect

FIFO, also called FCFS, keeps a queue of runnable jobs. New arrivals append to the tail, and the CPU runs the head until it blocks or exits.

struct task {
    int pid;
    int remaining_ms;
    struct task *next;
};

void fcfs_tick(struct task **head) {
    struct task *t = *head;
    t->remaining_ms--;
    if (t->remaining_ms == 0) {
        *head = t->next;      // next job starts only after completion
    }
}

FIFO has minimal policy overhead. It also creates the convoy effect. Suppose one CPU-bound job A needs 100 ms and ten I/O-bound jobs each need 1 ms of CPU before issuing disk I/O. All arrive at time 0 behind A.

0                                                          100 101 ... 110
|---------------------------- A ----------------------------|B0|B1|...|B9|

Average response time for the ten I/O-bound jobs is (100+101++109)/10=104.5(100+101+\cdots+109)/10=104.5 ms. If those short jobs ran first, their average response time would be (0+1++9)/10=4.5(0+1+\cdots+9)/10=4.5 ms. FIFO wastes no CPU cycles, but it can waste human time and leave I/O devices idle while short I/O-bound work waits.

Shortest-job-first chooses the runnable job with the smallest next CPU burst. If all jobs arrive together and their lengths are known, it minimizes average turnaround time. The catch is operational, not mathematical: the operating system rarely knows the next CPU burst. It can estimate from history using exponential averaging, such as Ek+1=αBk+(1α)EkE_{k+1}=\alpha B_k+(1-\alpha)E_k, where BkB_k is the measured last burst.

With α=0.5\alpha=0.5, a task with estimated burst 8 ms and observed bursts 2 ms, 2 ms, 10 ms gets estimates 5, 3.5, and 6.75 ms. This adapts, but a phase change can fool it.

Round-Robin, Priority, and Aging

Round-robin gives each runnable task a time quantum qq, then moves it to the tail if it remains runnable. For three CPU-bound tasks and q=4q=4 ms:

0   4   8   12  16  20  24
| A | B | C | A | B | C | ...

The quantum trades response time against overhead. If a context switch costs 5 microseconds and the quantum is 100 microseconds, overhead is about 5105=4.76%\frac{5}{105}=4.76\% of CPU time under continuous switching. If the quantum is 10 ms, overhead is about 510005=0.05%\frac{5}{10005}=0.05\%, but an interactive task behind 20 runnable tasks can wait nearly 200 ms before its next slice.

Priority scheduling chooses the highest-priority runnable task. It maps cleanly to kernel needs: a disk interrupt handler thread can outrank a background checksum process. It also risks starvation. If priority 100 tasks keep arriving, a priority 10 task might never run.

Aging raises the effective priority of waiting tasks. One simple rule is:

// Larger value means higher priority.
int effective_priority(struct task *t, long now_ms) {
    long waited = now_ms - t->last_runnable_ms;
    long bonus = waited / 50;          // +1 priority per 50 ms waiting
    return t->base_priority + bonus;
}

A task with base priority 10 waiting 500 ms gets a bonus of 10 and competes as priority 20. Aging bounds starvation without discarding priority as a signal.

Multi-Level Feedback Queue

Multi-level feedback queue scheduling uses several round-robin queues. Higher queues have shorter quanta and higher priority. New tasks start high. A task that uses its whole quantum moves down. A task that blocks before its quantum is exhausted stays high or moves up. Classic BSD and UNIX schedulers used variants of this idea because it learns behavior online.

A toy three-level MLFQ:

QueuePriorityQuantum
Q0highest4 ms
Q1middle8 ms
Q2lowest16 ms

Run this workload. Task I is interactive and alternates 1 ms CPU with I/O. Task C is CPU-bound and always runnable. Both arrive at time 0.

0 1  5 6      14 15               31
|I|C |I|  C   |I|       C          |
   Q0   Q0/Q1     Q0/Q1/Q2

C uses its full 4 ms at Q0 and drops to Q1. Then it uses full 8 ms at Q1 and drops to Q2. I keeps returning to Q0 because it blocks after 1 ms. The policy approximates SJF without knowing job lengths: tasks with short CPU bursts receive quick service, while CPU-bound tasks still make progress in lower queues.

MLFQ needs anti-gaming rules. If a task voluntarily yields at 3.9 ms of a 4 ms quantum forever, a naive MLFQ keeps it in Q0. Real systems account for cumulative CPU usage across yields, periodically boost old tasks, and decay usage over time.

Linux CFS and Virtual Runtime

Linux CFS does not assign fixed time quanta in the old round-robin sense. It tracks each schedulable entity's virtual runtime, vruntime, measured in nanoseconds scaled by weight. Lower vruntime means the task has received less weighted CPU service, so it is chosen next.

The core update is:

Δv=Δt1024w\Delta v = \Delta t \cdot \frac{1024}{w}

Here ww is the task weight. Nice 0 has weight 1024 in Linux's standard table. Nice 5 has weight 335. If both run for 10 ms of real CPU time, the nice 0 task gains 10 ms of vruntime, while the nice 5 task gains about 101024/335=30.5710\cdot 1024/335=30.57 ms. The lower-priority task's virtual time advances faster, so it is picked less often.

A simplified CFS loop looks like this:

struct sched_entity {
    uint64_t vruntime_ns;
    int weight;              // nice value mapped to a weight
};

void account(struct sched_entity *se, uint64_t ran_ns) {
    se->vruntime_ns += ran_ns * 1024 / se->weight;
}

struct sched_entity *pick_next(struct rb_root *runqueue) {
    return rb_leftmost(runqueue);      // smallest vruntime
}

The runqueue is a red-black tree ordered by vruntime. Picking the leftmost node is fast, and insertion after sleep or wakeup preserves ordering. Conceptually, if three equal-weight tasks have virtual runtimes 50, 55, and 60 ms, CFS runs the 50 ms task until it catches up, subject to latency and granularity limits.

Weights implement proportional share. With weights 1024 and 335, the CPU shares are:

10241024+335=0.7535\frac{1024}{1024+335}=0.7535 3351024+335=0.2465\frac{335}{1024+335}=0.2465

Across a long CPU-bound interval, nice 0 gets about 75.35 percent of one CPU and nice 5 gets about 24.65 percent, ignoring kernel accounting details and migration.

Real-Time Classes and Multiprocessors

Linux also exposes real-time scheduling classes. SCHED_FIFO runs the highest-priority runnable real-time task until it blocks, yields, or is preempted by a higher-priority real-time task. SCHED_RR adds a round-robin quantum among tasks at the same real-time priority. SCHED_DEADLINE uses deadline scheduling with parameters commonly described as runtime, deadline, and period.

For a periodic task with runtime 2 ms and period 10 ms, its reserved utilization is 2/10=0.22/10=0.2 CPU. On one CPU, a basic admission test for deadline workloads requires total reserved utilization not to exceed 1.0. For three tasks with utilizations 0.2, 0.3, and 0.4, the sum is 0.9, so the reservation fits under that simple test. If a fourth task with utilization 0.2 arrives, the sum is 1.1 and must be rejected or changed.

Multiprocessor scheduling adds placement. Linux uses per-CPU runqueues to avoid one global lock on every scheduling decision. Periodic load balancing moves tasks when one CPU has too much runnable load. Wakeup placement tries to keep a waking task near the CPU where it last ran because its cache lines and TLB entries might still be useful.

NUMA makes placement more expensive to get wrong. A task that repeatedly touches 4 GB of memory allocated on socket 0 should not bounce to socket 1 every few milliseconds. Modern schedulers track load, affinity, and NUMA locality, but they still trade locality against fairness. Keeping a task local can leave another core idle for a short time; moving it can reduce queueing but increase memory latency.

Key Result

Theorem

Shortest-Job-First Minimizes Average Turnaround for Simultaneous Arrivals

Statement

For a set of jobs with processing times p1,,pnp_1,\ldots,p_n, any nonpreemptive schedule that orders jobs by nondecreasing processing time minimizes average turnaround time.

Intuition

A long job placed before a short job adds its full length to the short job's completion time. Swapping them makes the long job wait only for the short length instead. The total completion time drops whenever a longer job precedes a shorter job.

Proof Sketch

Consider adjacent jobs ii then jj in a schedule, with shared prefix time PP and pi>pjp_i > p_j. Their contribution to total completion time is (P+pi)+(P+pi+pj)=2P+2pi+pj(P+p_i)+(P+p_i+p_j)=2P+2p_i+p_j. If swapped, the contribution is (P+pj)+(P+pj+pi)=2P+2pj+pi(P+p_j)+(P+p_j+p_i)=2P+2p_j+p_i. The swapped order is smaller by pipj>0p_i-p_j > 0. Repeatedly remove inversions until processing times are nondecreasing. Each swap reduces total completion time, so the sorted order is optimal.

Why It Matters

SJF is the target that MLFQ tries to approximate from observed behavior. Short CPU bursts get fast service, while long CPU-bound work is pushed back without requiring an oracle.

Failure Mode

The theorem fails when job lengths are unknown, arrivals differ, preemption has cost, or jobs block for I/O. In those settings, shortest remaining processing time, feedback queues, or fairness policies replace pure SJF.

Common Confusions

Watch Out

Fairness is not the same as low latency

Two equal-weight CPU-bound tasks can receive exactly half the CPU over one second while an interactive task still waits 50 ms after each keystroke. Fairness is a share property over an interval. Latency is about when service occurs.

Watch Out

A smaller quantum is not always better

If the quantum is 50 microseconds and a context switch costs 5 microseconds, the machine spends about 9.09 percent of its time switching under saturation. Smaller quanta reduce worst-case wait but increase accounting, cache, and TLB costs.

Watch Out

Nice values are not percentages

Nice values map to weights, and weights determine shares relative to other runnable tasks. A nice 5 task does not get 5 percent of the CPU. Against one nice 0 task, it gets about 24.65 percent because the weights are 335 and 1024.

Exercises

ExerciseCore

Problem

Three jobs arrive at time 0 with CPU bursts A=8 ms, B=4 ms, and C=1 ms. Compute average turnaround and average response time for FIFO order A, B, C and for SJF.

ExerciseCore

Problem

A round-robin scheduler has quantum 2 ms and context-switch cost 0.1 ms. Four CPU-bound tasks are always runnable. Estimate the context-switch overhead fraction. What is the maximum time from a task losing the CPU to getting it again?

ExerciseAdvanced

Problem

Two CFS tasks are CPU-bound. Task A has weight 1024, and task B has weight 335. Over a 100 ms interval on one CPU, compute their ideal CPU times and their vruntime increments.

References

Canonical:

  • Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau, Operating Systems: Three Easy Pieces (2018), ch. 7-10 and 48. Covers scheduling metrics, MLFQ, lottery scheduling, and multiprocessor scheduling.
  • Abraham Silberschatz, Peter B. Galvin, and Greg Gagne, Operating System Concepts (10th ed., 2018), ch. 5. Standard treatment of CPU scheduling algorithms and real-time scheduling.
  • Andrew S. Tanenbaum and Herbert Bos, Modern Operating Systems (4th ed., 2015), §2.4. Covers process scheduling, interactive scheduling, and policy trade-offs.
  • Marshall Kirk McKusick, George V. Neville-Neil, and Robert N. M. Watson, The Design and Implementation of the FreeBSD Operating System (2nd ed., 2014), ch. 4. Describes BSD process management and scheduler design.
  • Robert Love, Linux Kernel Development (3rd ed., 2010), ch. 4. Explains Linux process scheduling and CFS concepts.
  • Jane W. S. Liu, Real-Time Systems (2000), ch. 3 and ch. 6. Formal scheduling models for periodic and deadline-constrained tasks.

Accessible:

  • Linux kernel documentation, CFS Scheduler (Documentation/scheduler/sched-design-CFS.rst). Primary open documentation for CFS virtual runtime.
  • Linux kernel documentation, Deadline Task Scheduling (Documentation/scheduler/sched-deadline.rst). Describes SCHED_DEADLINE parameters and admission control.
  • Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau, Operating Systems: Three Easy Pieces, freely available online.

Next Topics

  • /computationpath/virtual-memory
  • /computationpath/processes-and-threads
  • /computationpath/filesystems
  • /computationpath/concurrency-primitives