Why This Matters
A 1 kHz scheduler tick interrupts each CPU every 1 ms. On a 64-core machine, that is 64,000 timer interrupts per second before counting NIC receive interrupts, disk completions, page faults, or inter-processor interrupts. Each entry into the kernel saves architectural state, runs a small amount of privileged code, and eventually returns with an instruction such as iretq.
A context switch can save registers in tens to hundreds of nanoseconds, but switching to a different address space can cost microseconds when the TLB is invalidated and refilled. At 50,000 switches per second, an extra 2 microseconds per switch consumes 0.1 CPU-seconds per wall-clock second on one core.
Core Definitions
Interrupt
An interrupt is a control transfer from the currently executing instruction stream to a privileged handler selected by an interrupt vector. Hardware interrupts are usually asynchronous with respect to the running program: a timer expires, a NIC receives a packet, a disk completes DMA, or another core sends an inter-processor interrupt.
Trap or Exception
A trap is a synchronous transfer caused by the instruction being executed. Examples include page faults, divide-by-zero exceptions, breakpoints, and system calls. The faulting instruction pointer is part of the saved state, so the kernel can restart, emulate, or report the event.
Trap Frame
A trap frame is the saved machine state used to return from the kernel to the interrupted context. It contains at least the program counter, status flags, stack pointer, and enough general-purpose registers to obey the kernel's calling convention and restart the interrupted code.
Context Switch
A context switch replaces the currently running thread or process with another runnable entity. A thread switch within one process changes registers and kernel scheduling state. A process switch also changes the active address space, often affecting the TLB and cache footprint.
Interrupts, Traps, and the Vector Table
The CPU does not poll every device in a loop. Devices signal an interrupt controller, and the controller presents an interrupt vector to a CPU. On x86 systems this is handled by the local APIC and I/O APIC, while ARM servers commonly use a Generic Interrupt Controller, or GIC. The vector is an integer index into a table of kernel entry points.
Typical sources are:
- Timer interrupt, used for accounting, preemption, and timeouts.
- NIC interrupt, used to report receive queues or transmit completions.
- Disk interrupt, used to report completed block I/O.
- Inter-processor interrupt, used by one CPU to ask another CPU to reschedule, flush a TLB entry, or run a cross-core function.
A synchronous trap has a different cause chain. For a page fault, the running instruction issues a load or store, the MMU misses or rejects the translation, and the CPU enters the page fault handler. A system call uses an instruction such as syscall on x86-64 or svc on AArch64 to request kernel service.
A minimal vector table model is:
typedef void (*handler_t)(struct trap_frame *tf);
handler_t vectors[256];
void dispatch(unsigned vector, struct trap_frame *tf) {
handler_t h = vectors[vector];
h(tf);
}
The real hardware path also changes privilege level, changes stacks if entering from user mode, masks some interrupt classes, and records an error code for selected exceptions. The vector dispatch idea remains the same.
A small event trace shows the distinction:
t=10.000000 ms user thread executes load [rdi]
t=10.000003 ms page fault trap vector enters kernel
t=10.000900 ms handler maps page, returns to same instruction
t=11.000000 ms user thread executes add rax, rbx
t=11.000001 ms timer interrupt arrives from local APIC
t=11.000120 ms scheduler chooses another thread
The page fault is tied to the load. The timer interrupt is not tied to the add; it happened between architectural instruction boundaries.
Trap Entry and Return: A Byte-Level Walk
A trap frame is a contract between assembly entry code and C kernel code. A simplified x86-64 trap frame might save these 64-bit fields after entering from user mode:
offset bytes field
0x00 8 r15
0x08 8 r14
0x10 8 r13
0x18 8 r12
0x20 8 r11
0x28 8 r10
0x30 8 r9
0x38 8 r8
0x40 8 rdi
0x48 8 rsi
0x50 8 rbp
0x58 8 rbx
0x60 8 rdx
0x68 8 rcx
0x70 8 rax
0x78 8 vector
0x80 8 error_code
0x88 8 rip
0x90 8 cs
0x98 8 rflags
0xa0 8 rsp
0xa8 8 ss
This frame is 176 bytes. The exact order differs by OS and architecture, but the same requirements hold: the kernel must preserve enough state to resume the interrupted execution or to deliver a signal with a correct user-visible register image.
An entry stub often has the shape below. The snippet omits hardware-specific descriptor setup and uses Intel-style assembly to show the data movement.
interrupt_entry:
push rax
push rcx
push rdx
push rbx
push rbp
push rsi
push rdi
push r8
push r9
push r10
push r11
mov rdi, rsp ; first C argument: pointer to saved registers
call common_trap
pop r11
pop r10
pop r9
pop r8
pop rdi
pop rsi
pop rbp
pop rbx
pop rdx
pop rcx
pop rax
iretq
If the interrupted code was in user mode, the CPU also switches to a kernel stack. This avoids trusting the user stack during privileged execution. On return, iretq restores the saved instruction pointer, code segment, flags, stack pointer, and stack segment.
For a system call, the control transfer is synchronous and intentional:
mov rax, 1 ; syscall number: write on many Unix-like ABIs
mov rdi, 1 ; fd
lea rsi, [rel buf] ; buffer
mov rdx, 12 ; length
syscall
The kernel entry code still needs a saved user program counter and flags. Unlike a device interrupt, the syscall path usually has a stable ABI for argument registers and return values.
Splitting Interrupt Work: Top Halves, Bottom Halves, and NAPI
Interrupt handlers must run quickly because they often execute with local interrupts disabled or with constraints on sleeping. A common split is top half and bottom half.
The top half acknowledges the device and records a small amount of state. The bottom half does heavier work later, such as packet processing, waking blocked tasks, or copying data into socket buffers. Linux names several deferral mechanisms, including softirqs, tasklets in older code, workqueues, and NAPI for networking.
A network receive path can look like this:
/* Top half: runs in hard interrupt context. */
irqreturn_t nic_irq(int irq, void *dev_id) {
struct nic *n = dev_id;
writel(INT_RX_ACK, n->mmio + INT_STATUS); // stop interrupt storm
napi_schedule(&n->napi); // defer packet work
return IRQ_HANDLED;
}
/* Bottom half: runs later, often through NET_RX softirq. */
int nic_poll(struct napi_struct *napi, int budget) {
struct nic *n = container_of(napi, struct nic, napi);
int done = 0;
while (done < budget && rx_descriptor_ready(n)) {
build_skb_from_rx_ring(n);
done++;
}
if (done < budget)
napi_complete_done(napi, done);
return done;
}
NAPI changes interrupt behavior under load. The first packet causes an interrupt. The driver disables or suppresses further receive interrupts for that queue and polls descriptors up to a budget. If packets keep arriving, polling continues through softirq work. If the queue drains, interrupts are re-enabled. This reduces interrupt rate when traffic is high.
A numeric example: suppose a NIC receives 800,000 packets per second. One interrupt per packet would create 800,000 hard interrupts per second. If NAPI processes 64 packets per poll, the hard interrupt rate for receive notification can fall toward 800000 / 64 = 12500 per second while the kernel still processes every packet.
Softirq execution also connects to scheduling. If bottom-half work runs too long, it competes with user threads. Linux may move excess work to a kernel thread such as ksoftirqd, which makes the cost visible to the scheduler.
Context Switching and Preemption
A voluntary context switch happens when a thread blocks or yields. Examples include read() waiting for disk, futex() waiting for a lock, or a thread calling sched_yield(). An involuntary switch happens when preemption stops a runnable thread, often after a timer tick or scheduler interrupt.
The switch has several layers:
- Save callee-saved registers and kernel stack pointer of the old thread.
- Select a runnable thread.
- Load the new thread's kernel stack pointer and registers.
- If the address space changes, install a new page-table root.
- Return toward user mode through the new thread's trap frame.
A simplified kernel switch routine resembles:
struct context {
uint64_t rsp;
uint64_t rbx, rbp, r12, r13, r14, r15;
};
void switch_to(struct task *prev, struct task *next);
void schedule(void) {
struct task *prev = current;
struct task *next = pick_next_task();
if (prev == next)
return;
current = next;
switch_to(prev, next);
}
The register save is not the whole cost. If prev and next use different page tables, the CPU's cached virtual-to-physical translations may no longer match. Without address-space tags, loading a new page-table root flushes much of the TLB. With PCID on x86 or ASIDs on many other architectures, translations can be tagged by address space, reducing flushes across switches.
Cache effects are harder to price but often larger than the assembly path. Suppose thread A uses a 32 KiB L1 data cache footprint and thread B also uses 32 KiB. With 64-byte cache lines, each footprint is 512 lines. If B replaces most of A's lines, returning to A later causes hundreds of L1 misses. At 4 cycles for an L1 hit and 40 cycles for an L3 hit, replacing 400 lines costs about 400 * (40 - 4) = 14400 extra cycles, or 4.8 microseconds at 3 GHz, before DRAM misses.
Key Result
A context switch cost is the sum of a direct kernel path and an indirect locality penalty:
The first three terms are usually bounded by a short instruction path. C_{\text{mmu}} depends on whether the address space changes and whether the processor tags TLB entries. C_{\text{cache}} depends on the overlap between the old and new working sets.
A useful invariant for OS work is:
If a service causes 20,000 cross-process switches per second and each switch costs 3 microseconds including TLB and cache refill, the lost fraction on one core is 20000 * 0.000003 = 0.06, or 6 percent of a core. If PCID and locality reduce the average to 700 ns, the cost falls to 1.4 percent of a core.
For timer-driven preemption, the tick rate gives a lower bound on interrupt traffic. A 250 Hz tick on 32 CPUs creates 250 * 32 = 8000 timer interrupts per second. Tickless kernels reduce periodic interrupts when CPUs are idle, but device interrupts, IPIs, page faults, and syscalls still enter the same class of kernel mechanisms.
Common Confusions
A system call is not an asynchronous interrupt
A system call is synchronous. The user instruction requests kernel entry, and the kernel returns a result through the calling convention. A NIC receive interrupt is asynchronous. The running thread did not request it and may have no relation to the network socket that will receive the packet.
Saving registers is not the full context switch cost
Counting only push and pop instructions misses TLB refill, branch predictor effects, cache replacement, scheduler accounting, and deferred interrupt work. Two threads in the same process often switch faster than two processes with disjoint address spaces and disjoint cache footprints.
Bottom halves are not free background work
Softirqs and NAPI polling run on CPUs and consume scheduler-visible time. Under high packet rates, receive processing can displace user code even if the hard interrupt rate is low.
Exercises
Problem
A CPU receives a 1 kHz scheduler tick. The hard interrupt entry, accounting, and return path take 900 ns when no context switch occurs. On a 48-core machine, how much CPU time per wall-clock second is spent only on these timer interrupt paths?
Problem
Using the simplified trap frame layout above, give the byte offset of rax, rip, and the user rsp. If the trap frame starts at address 0xffff888000010000, what are the addresses of those fields?
Problem
A server performs 12,000 voluntary context switches per second within one process and 8,000 involuntary switches per second across two processes. Same-address-space switches cost 500 ns. Cross-address-space switches cost 2.5 microseconds without PCID and 1.1 microseconds with PCID. Compute the lost CPU fraction in both cases.
References
Canonical:
- Arpaci-Dusseau and Arpaci-Dusseau, Operating Systems: Three Easy Pieces (2018), ch. 4-12, 27-32, 36-43, 48, processes, scheduling, virtual memory, concurrency, and persistence context
- Silberschatz, Galvin, and Gagne, Operating System Concepts, 10th ed. (2018), ch. 1, 5, 9, interrupts, CPU scheduling, and virtual memory
- Tanenbaum and Bos, Modern Operating Systems, 4th ed. (2014), §1.3.4 and ch. 2-3, interrupt handling, processes, scheduling, and memory management
- Bovet and Cesati, Understanding the Linux Kernel, 3rd ed. (2005), ch. 3-4, 8, process switching, interrupts, exceptions, and memory management
- Intel, Intel 64 and IA-32 Architectures Software Developer's Manual, Vol. 3A (2025), ch. 4, 6, 8, paging, interrupts, exceptions, and APIC behavior
- Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 6th ed. (2017), ch. 2 and 5, processor performance, caches, and memory hierarchy costs
Accessible:
- Linux kernel documentation, NAPI, Softirqs, and Generic IRQ Handling, practical descriptions of deferred network and interrupt work
- Robert Love, Linux Kernel Development, 3rd ed. (2010), ch. 6-8, readable coverage of interrupts, bottom halves, and scheduling
- OSTEP online book chapters on processes, scheduling, and virtual memory, open-access companion to the canonical text
Next Topics
/computationpath/processes-and-threads/computationpath/cpu-scheduling/computationpath/virtual-memory-and-tlbs/computationpath/networking-in-the-kernel