Skip to main content

CPU and Machine Model · 40 min

Instruction Set Architectures (ISA)

The contract between software and hardware. RISC vs CISC, x86-64 vs ARM64 vs RISC-V, and how ISA choices shape every layer above.

Why This Matters

One x86-64 instruction such as addq $1, 8(%rdi) is 5 bytes and can read memory, add, and write memory. A comparable RV64I sequence needs a load, an add-immediate, and a store, usually 12 bytes without compressed instructions. That difference is visible to the compiler, the decoder, the instruction cache, the out-of-order core, and the performance counter stream.

An instruction set architecture is the boundary that lets the same compiled binary run across many implementations. Intel Skylake, AMD Zen, and a small in-order x86 core all implement x86-64, but their internal pipelines differ. The ISA says what state changes after an instruction retires; the microarchitecture chooses how to get there.

Core Definitions

Definition

Instruction Set Architecture

An instruction set architecture, or ISA, is the programmer-visible contract for a processor family. It specifies registers, instruction encodings, instruction semantics, addressing modes, data types, trap behavior, privilege transitions, and the memory consistency rules that constrain concurrent execution.

Definition

Microarchitecture

A microarchitecture is one implementation of an ISA. It includes the pipeline, caches, branch predictors, reorder buffers, execution units, decoders, register renaming, and load-store queues. Two processors may share an ISA while having different latency, throughput, cache size, and speculation behavior.

Definition

Load-Store Architecture

A load-store ISA restricts arithmetic and logical instructions to registers. Memory is accessed only by explicit load and store instructions. MIPS, classic ARM, AArch64, Power, and RISC-V follow this design.

Definition

Memory Consistency Model

A memory consistency model defines which results are permitted when multiple hardware threads access shared memory. It is weaker than source-language ordering and stronger than arbitrary reordering. X86-64 has Total Store Order, while ARMv8-A and RISC-V allow weaker behavior unless fences or acquire-release operations are used.

The ISA Contract

An ISA fixes five pieces of state and behavior: registers, instructions, addressing, memory ordering, and traps. Registers are the named storage locations visible to machine code. X86-64 exposes 16 general-purpose 64-bit registers, named rax through r15, plus flags and vector registers. AArch64 exposes 31 general-purpose 64-bit registers, x0 through x30, with sp as a special stack pointer encoding in many instructions. RV64I exposes 32 integer registers, x0 through x31, where x0 always reads as zero.

Instructions define state transitions. In AT&T syntax, this x86-64 instruction adds one to a 64-bit value in memory:

addq $1, 8(%rdi)

Its bytes are:

48 83 47 08 01

The byte layout is compact. 48 is a REX prefix selecting 64-bit operand size, 83 is the opcode for an immediate arithmetic operation on a register or memory operand, 47 is the ModR/M byte selecting (%rdi) with an 8-bit displacement, 08 is the displacement, and 01 is the immediate.

The ISA semantics are a read-modify-write transition. If rdi = 0x1000 and memory at 0x1008 holds 0x0000000000000029, then after retirement memory at 0x1008 holds 0x000000000000002a. Temporary internal values do not matter unless the instruction faults.

AArch64 uses fixed 32-bit instruction words. A comparable sequence is:

ldr x1, [x0, #8]
add x1, x1, #1
str x1, [x0, #8]

Every instruction is 4 bytes. The encoding for add x1, x1, #1 is 0x91000421, stored little-endian as:

21 04 00 91

Traps are part of the same contract. Division by zero, an illegal opcode, a page fault, or a system call transfers control to privileged code with defined machine state. On x86-64, syscall changes privilege level, saves the return instruction pointer in rcx, saves flags in r11, and jumps to a model-specific target address. On RISC-V, ecall records a cause code in a control and status register and transfers to a trap vector chosen by privilege state.

RISC Design

Reduced Instruction Set Computer design came from measuring compiler output and hardware cost. The common pattern is fixed-width encodings, few addressing modes, register-register arithmetic, and a register file large enough to keep local values out of memory. MIPS and early ARM made these rules explicit. AArch64, Power, and RISC-V keep the load-store core while adding vector, atomic, floating-point, and system extensions.

Consider the expression a[i] += 3 where a is an array of 32-bit integers, x10 holds a, and x11 holds i. RV64I code might be:

slli x12, x11, 2      # byte offset = i * 4
add  x12, x10, x12    # address = a + offset
lw   x13, 0(x12)      # load a[i]
addi x13, x13, 3
sw   x13, 0(x12)      # store a[i]

The instruction lw x13, 0(x12) has encoding 0x00062683, stored as:

83 26 06 00

The fixed 4-byte length makes the next program counter easy to compute for non-control-flow instructions. Fetching eight instructions requires 32 bytes. Decode logic does not need to discover where the next instruction begins.

The cost is code size. A load-store ISA often spends more instruction bytes on address arithmetic and explicit loads. Compressed encodings reduce this cost. RISC-V C adds 16-bit encodings for common operations. The base integer ISA remains simple enough that a tiny implementation can ignore floating point, atomics, and compressed instructions.

RISC-V names this modularity in its ISA strings. RV64GC expands to RV64I + M + A + F + D + C. RV64I is the 64-bit base integer ISA. M adds integer multiply and divide. A adds atomic memory operations. F and D add single- and double-precision floating point. C adds compressed 16-bit instructions. A Linux-capable general-purpose RISC-V system usually needs A for atomics and F/D for common application binary interfaces.

CISC Design and Modern x86

Complex Instruction Set Computer ISAs favor dense encodings, memory operands, and instructions that perform multi-step operations. X86 inherited 8086 encodings, added 32-bit mode, then x86-64. The result is variable-length instructions from 1 to 15 bytes, prefix bytes, ModR/M and SIB bytes, displacements, immediates, and many special cases. IBM z/Architecture also uses dense variable-length encodings and rich addressing.

The density matters. This x86-64 loop increments four adjacent 64-bit counters:

addq $1, 0(%rdi)
addq $1, 8(%rdi)
addq $1, 16(%rdi)
addq $1, 24(%rdi)

Each instruction uses 5 bytes, so the loop body is 20 bytes before branch overhead. AArch64 needs load, add, and store per counter, 12 instructions and 48 bytes before branch overhead. If the loop is limited by instruction-cache footprint, dense encodings win.

Modern x86 cores do not execute these instructions as monolithic hardwired operations. The front end decodes x86 instructions into internal micro-operations, often called µops. A read-modify-write addq commonly becomes a load µop, an ALU µop, and a store-address or store-data µop. The back end schedules those µops through reservation stations and execution ports much like a RISC-like dataflow machine.

This split explains a common performance surprise. A short x86 instruction is not always cheap. rep movsb has a tiny encoding and complex semantics, but modern cores contain special machinery for string moves. A different complex instruction may decode to several µops and consume front-end bandwidth. Performance engineering on x86 requires both ISA knowledge and microarchitecture data, such as latency tables and measured counters.

Memory Ordering in the ISA

The ISA also constrains how memory operations from one core become visible to another. Source code order is not the same as hardware visibility order. Store buffers, invalidate queues, and speculative loads all affect what other cores observe.

The standard store-buffering litmus test is:

// Initially x = 0, y = 0

// Thread 0
x = 1;
r1 = y;

// Thread 1
y = 1;
r2 = x;

Under sequential consistency, the outcome r1 == 0 && r2 == 0 is impossible. At least one load must see the other thread's store. Under x86-64 Total Store Order, that outcome is permitted because each core may read before its older store has drained from its local store buffer to the coherent memory system. TSO still preserves store-store order and load-load order for ordinary aligned memory operations.

ARMv8-A is weaker. It allows more reorderings unless code uses barriers such as dmb ish or acquire-release instructions such as ldar and stlr. RISC-V base memory ordering is also weak, with fence and the A extension giving the programmer and compiler ways to enforce ordering. This is a contract issue, more than just a compiler issue. Correct lock-free code must be written against the language model and the target ISA model.

Benchmarking ISA Claims

ISA comparisons are easy to overstate. The CPU time equation separates work, microarchitecture, and clock rate:

CPU time=instruction count×CPI×cycle time\text{CPU time} = \text{instruction count} \times \text{CPI} \times \text{cycle time}

An ISA can change instruction count and code size. A microarchitecture changes CPI and cycle time. A compiler changes all three. A benchmark result comparing x86-64 and AArch64 machines is also comparing cache hierarchy, branch predictor, memory bandwidth, process node, thermal limit, compiler flags, and operating system noise.

SPEC CPU is designed to reduce some of this noise with fixed workloads and reporting rules. It still measures full systems, not pure ISAs. Geekbench mixes kernels and produces a single score that is convenient for rough comparison but poor for explaining a performance cause. For ML engineers, the most useful benchmark is often a narrowed workload with counters: retired instructions, cycles, branch misses, L1 misses, last-level cache misses, vector width used, and memory bandwidth.

Key Result

The ISA is an abstraction boundary with two invariants.

First, architectural state after each retired instruction must match the ISA semantics. Let SiS_i be architectural state after instruction ii in program order. For instruction semantics fif_i, correct execution requires:

Si+1=fi(Si)S_{i+1} = f_i(S_i)

Out-of-order execution, speculative loads, µops, register renaming, and branch prediction are allowed only if retired architectural state matches this sequence, except where the ISA explicitly permits relaxed memory observations between cores.

Second, performance is not an ISA scalar. For two machines A and B running equivalent programs:

TATB=IAIB×CPIACPIB×τAτB\frac{T_A}{T_B} = \frac{I_A}{I_B} \times \frac{\text{CPI}_A}{\text{CPI}_B} \times \frac{\tau_A}{\tau_B}

Here II is retired instruction count and τ\tau is cycle time. RISC code may have higher II and lower decode cost. CISC code may have lower II and higher front-end cost. Either can win on a given workload.

Common Confusions

Watch Out

Thinking CISC instructions execute directly as written

On modern x86, many architectural instructions are decoded into µops before scheduling. The ISA instruction is still the unit visible to exceptions, disassembly, and binary compatibility. The core's scheduler usually works on internal µops. Confusing these two levels leads to wrong claims about why an instruction is fast or slow.

Watch Out

Equating RISC with faster

RISC describes an ISA style, not a speed guarantee. A fixed-width load-store ISA simplifies parts of fetch and decode, but a slow cache, weak branch predictor, or narrow execution core can dominate runtime. The CPU time equation must include instruction count, CPI, and cycle time.

Watch Out

Treating benchmarks as ISA measurements

SPEC CPU, Geekbench, and application benchmarks measure full machines. If one laptop beats another on a benchmark, the cause might be cache size, memory bandwidth, turbo policy, compiler version, or vector library selection. ISA effects are present, but not isolated without controlled experiments and counters.

Exercises

ExerciseCore

Problem

Compare code size for incrementing a 64-bit integer at address p + 8. Use x86-64 addq $1, 8(%rdi) and AArch64 ldr x1, [x0, #8]; add x1, x1, #1; str x1, [x0, #8]. Assume x86 bytes are 48 83 47 08 01 and each AArch64 instruction is 4 bytes. Give the byte totals and the memory state transition when the old value is 41.

ExerciseCore

Problem

Expand RV64GC into its named RISC-V extensions. For each component, state whether it is required for integer add, integer multiply, atomic compare-and-swap style code, double-precision floating point, or 16-bit compressed encodings.

ExerciseAdvanced

Problem

For the store-buffering test with initial x = 0 and y = 0, explain why r1 = 0 and r2 = 0 is forbidden under sequential consistency but permitted under x86-64 TSO.

References

Canonical:

  • Hennessy and Patterson, Computer Architecture: A Quantitative Approach (6th ed., 2017), ch. 1, quantitative design and performance equations
  • Hennessy and Patterson, Computer Architecture: A Quantitative Approach (6th ed., 2017), ch. 2, memory hierarchy interactions with instruction fetch and data access
  • Hennessy and Patterson, Computer Architecture: A Quantitative Approach (6th ed., 2017), ch. 3, instruction-level parallelism and out-of-order execution
  • Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective (3rd ed., 2016), ch. 3, x86-64 machine-level programming
  • Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective (3rd ed., 2016), ch. 4, processor architecture and sequential ISA execution
  • Intel, Intel 64 and IA-32 Architectures Software Developer's Manual, Vol. 1, ch. 3, basic execution environment
  • Arm, Arm Architecture Reference Manual for A-profile Architecture, sections on A64 instruction set and memory ordering
  • RISC-V International, The RISC-V Instruction Set Manual, Volume I: Unprivileged ISA, chapters 1-2 and extension chapters for M, A, F, D, and C

Accessible:

  • Patterson and Waterman, The RISC-V Reader
  • Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective, associated CMU lecture materials
  • Agner Fog, The microarchitecture of Intel, AMD and VIA CPUs, sections on x86 decoding and µops

Next Topics

  • /computationpath/cpu-pipelines
  • /computationpath/memory-hierarchy
  • /computationpath/out-of-order-execution
  • /computationpath/memory-consistency-models
  • /computationpath/performance-counters