Why This Matters
A single counter++ on x86-64 usually compiles to a load, an add, and a store. Two threads executing those three steps on the same address can both read 41 and both store 42. The program ran two increments, but the memory cell records one.
Locks serialize access to shared state. They also move cache lines between cores, put threads to sleep, wake them later, and create deadlock if acquired in inconsistent orders. A lock that costs 30 ns when uncontended can cost microseconds under contention once scheduler and coherence traffic enter the path.
Core Definitions
Race condition
A race condition occurs when the result of a computation depends on the timing or interleaving of concurrent operations on shared state, and at least one operation writes. If two schedules allowed by the machine produce different visible results, the program has a race at that state.
Critical section
A critical section is a region of code that accesses shared mutable state and must not overlap with another conflicting critical section. For a counter, the critical section is the read-modify-write sequence, more than just the final store.
Mutual exclusion
Mutual exclusion is the property that at most one thread is inside a given critical section at a time. A lock implements mutual exclusion when every path into the critical section first acquires the lock and every path out releases it.
Atomic read-modify-write
An atomic read-modify-write instruction reads a memory location, computes a new value, writes it back, and exposes the whole operation as one indivisible event to other cores. Test-and-set, fetch-and-add, and compare-and-swap are the common forms used for locks.
A Race on counter++
Consider this C++ program fragment with two threads calling inc() once.
int counter = 41;
void inc() {
counter++;
}
A typical unoptimized x86-64 shape is:
mov eax, DWORD PTR [counter] ; load
add eax, 1 ; compute
mov DWORD PTR [counter], eax ; store
One losing interleaving is:
| Step | Thread A | Thread B | counter in memory |
|---|---|---|---|
| 1 | load 41 into eax | 41 | |
| 2 | load 41 into eax | 41 | |
| 3 | add, eax = 42 | 41 | |
| 4 | add, eax = 42 | 41 | |
| 5 | store 42 | 42 | |
| 6 | store 42 | 42 |
The expected value after two increments is 43, but the actual value is 42. The lost update is not a compiler bug. It is a missing synchronization edge.
A lock moves the race boundary:
#include <mutex>
int counter = 41;
std::mutex m;
void inc() {
std::lock_guard<std::mutex> g(m);
counter++;
}
The lock does not make counter++ atomic by itself. It prevents overlap among threads that obey the same protocol.
Atomic Instructions Used by Locks
Atomic instructions give software a small set of indivisible transitions.
Test-and-set writes 1 and returns the old value. A simple spinlock uses value 0 for unlocked and 1 for locked.
typedef struct {
volatile int v;
} tas_lock_t;
int test_and_set(volatile int *p) {
return __sync_lock_test_and_set(p, 1);
}
void lock(tas_lock_t *l) {
while (test_and_set(&l->v) == 1) {
;
}
}
void release_lock(tas_lock_t *l) {
__sync_lock_release(&l->v);
}
If l->v is 0, the winning thread atomically changes it to 1 and enters. If l->v is 1, the loser writes 1 again and keeps spinning. That repeated write matters for cache coherence.
Fetch-and-add returns the old value and increments the memory cell. It fits ticket locks.
typedef struct {
unsigned next;
unsigned owner;
} ticket_lock_t;
void ticket_lock(ticket_lock_t *l) {
unsigned my = __sync_fetch_and_add(&l->next, 1);
while (__atomic_load_n(&l->owner, __ATOMIC_ACQUIRE) != my) {
;
}
}
void ticket_unlock(ticket_lock_t *l) {
__atomic_fetch_add(&l->owner, 1, __ATOMIC_RELEASE);
}
If next = 7 and three threads arrive, they get tickets 7, 8, and 9. Only the thread whose ticket equals owner enters. FIFO order follows from the monotonic counters.
Compare-and-swap, usually CAS, changes *p from expected to desired only if the old value equals expected.
#include <atomic>
std::atomic<int> lockword{0};
void cas_lock() {
int expected;
do {
expected = 0;
} while (!lockword.compare_exchange_weak(
expected, 1,
std::memory_order_acquire,
std::memory_order_relaxed));
}
void cas_unlock() {
lockword.store(0, std::memory_order_release);
}
CAS is also the base primitive for many lock-free structures. Its return value tells the caller whether it won without hiding the old state.
Spinlocks, Mutexes, and Futexes
A spinlock busy-waits in user space. It works well when the lock holder runs on another core and the critical section is shorter than the cost of sleeping and waking a thread.
Suppose a critical section takes 80 ns. A Linux futex sleep plus wake path can cost several microseconds even without I/O. Spinning for 80 ns wastes cycles but avoids a kernel round trip. If the holder is descheduled for 5 ms, spinning burns a full time slice and delays other work.
A practical mutex often has a two-phase design. First it tries an atomic transition in user space. If the lock stays unavailable, it asks the kernel to sleep until the word changes. Linux exposes this pattern through futex.
A minimal futex-shaped mutex looks like this. Real library mutexes handle signals, timeouts, priority inheritance variants, and more states.
#include <stdatomic.h>
#include <linux/futex.h>
#include <sys/syscall.h>
#include <unistd.h>
typedef struct {
atomic_int state; /* 0 unlocked, 1 locked */
} mini_mutex_t;
static int futex_wait(atomic_int *addr, int expected) {
return syscall(SYS_futex, addr, FUTEX_WAIT, expected, 0, 0, 0);
}
static int futex_wake(atomic_int *addr) {
return syscall(SYS_futex, addr, FUTEX_WAKE, 1, 0, 0, 0);
}
void mini_lock(mini_mutex_t *m) {
int expected = 0;
for (int i = 0; i < 100; i++) {
expected = 0;
if (atomic_compare_exchange_weak(&m->state, &expected, 1)) return;
}
while (atomic_exchange(&m->state, 1) == 1) {
futex_wait(&m->state, 1);
}
}
void mini_unlock(mini_mutex_t *m) {
atomic_store(&m->state, 0);
futex_wake(&m->state);
}
The invariant is simple: the lock word is the rendezvous variable. User space changes it with atomics; the kernel blocks a thread only if the word still has the expected value. That avoids a missed wake in the common race where the release occurs just before the wait syscall.
Reader-Writer, Ticket, and MCS Locks
A reader-writer lock allows many readers or one writer. It fits read-mostly state such as a routing table snapshot. It does not fit a hot counter, because every reader still touches shared lock metadata.
One compact layout packs a writer bit and a reader count:
| Bits | Meaning |
|---|---|
| bit 31 | writer present |
| bits 0 through 30 | reader count |
Example states for a 32-bit word:
| Hex word | Meaning |
|---|---|
0x00000000 | unlocked |
0x00000003 | three readers |
0x80000000 | one writer |
0x80000002 | invalid for a strict implementation |
A reader acquires by checking the writer bit and incrementing the low bits atomically. A writer acquires by changing zero to 0x80000000 with CAS. Policy matters. Reader preference can starve writers if readers keep arriving. Writer preference can delay read bursts.
Ticket locks give FIFO order with two counters. Their main cost is that all waiters poll the same owner cache line. On release, the owner counter changes, and every waiter observes invalidation or reload traffic.
MCS locks avoid that broadcast by giving each waiting thread a node.
typedef struct mcs_node {
_Atomic(struct mcs_node*) next;
_Atomic(int) locked;
} mcs_node_t;
typedef struct {
_Atomic(mcs_node_t*) tail;
} mcs_lock_t;
Acquire appends the thread's node to a linked queue using atomic exchange on tail. If there was a predecessor, the thread stores itself into pred->next and spins on its own locked field. Release either resets tail if there is no successor or clears the successor's locked flag. Under contention, each waiter spins on a different cache line, so release hands ownership to one core.
Cache-Coherence Costs and False Sharing
Coherent caches track ownership of cache lines, not C variables. On many machines a line is 64 bytes. Two unrelated locks in one line can interfere.
Suppose lockA is at address 0x1000 and lockB is at 0x1004. With 64-byte lines, both lie in the line covering 0x1000 through 0x103f.
struct bad {
atomic_int lockA; /* offset 0 */
atomic_int lockB; /* offset 4 */
};
If core 0 repeatedly writes lockA and core 1 repeatedly writes lockB, the whole line ping-pongs between the cores even though the variables differ. This is false sharing.
A padded layout separates them:
struct padded_lock {
atomic_int v;
char pad[60];
};
struct good {
struct padded_lock lockA; /* line 0 */
struct padded_lock lockB; /* line 1 */
};
The byte layout is:
| Offset range | Field |
|---|---|
0..3 | lockA.v |
4..63 | padding |
64..67 | lockB.v |
68..127 | padding |
Test-and-set spinlocks are worse than read-spinning locks because losers keep writing the same word. A test-test-and-set lock first reads until the word looks free, then attempts the write:
void ttas_lock(atomic_int *l) {
for (;;) {
while (atomic_load_explicit(l, memory_order_relaxed) == 1) {
;
}
int expected = 0;
if (atomic_compare_exchange_strong_explicit(
l, &expected, 1,
memory_order_acquire,
memory_order_relaxed)) return;
}
}
It still contends at release time, but it reduces write invalidations while the lock is held.
Deadlock and Lock Ordering
Deadlock is permanent waiting caused by a cycle of resource dependencies. Coffman's four conditions are the standard test:
- Mutual exclusion, at least one resource has exclusive ownership
- Hold and wait, a thread holds one resource while requesting another
- No preemption, resources are released only by their holders
- Circular wait, a cycle exists in the wait-for graph
The classic two-lock bug is small.
std::mutex A, B;
void t1() {
std::lock_guard<std::mutex> g1(A);
std::lock_guard<std::mutex> g2(B);
}
void t2() {
std::lock_guard<std::mutex> g1(B);
std::lock_guard<std::mutex> g2(A);
}
If t1 holds A and waits for B, while t2 holds B and waits for A, all four conditions hold.
Lock ordering removes circular wait. Assign every lock a rank and acquire locks in increasing rank. If rank(A) = 10 and rank(B) = 20, both functions must acquire A before B. Release order can be reverse order for local clarity.
Lock-free and wait-free do not mean lock-free from bugs. A lock-free operation guarantees system-wide progress: among competing threads, some operation completes in a finite number of steps. A wait-free operation guarantees per-thread progress: each thread completes its operation in a finite number of its own steps. Wait-free is stronger and often costs more memory or copying.
The Model
For lock-based code, three invariants do most of the work.
First, the lock state has one owner or no owner:
For mutual exclusion:
Second, the lock acquire must provide an acquire ordering and release must provide a release ordering. In C++ terms, writes inside the critical section before the release operation become visible to a thread after a successful lock() on the same mutex, assuming the data is accessed under that mutex.
Third, contention cost is more than just instruction count. A crude model for a contended lock is:
For an uncontended user-space lock, T_sleep/wake = 0 and N_miss is small. For a hot lock with 16 waiters on 16 cores, N_miss can dominate. Ticket locks bound unfairness but make every waiter observe the same owner word. MCS locks trade pointer bookkeeping for local spinning.
A useful engineering rule is to classify the lock by hold time and contention. Short hold time plus low contention favors a spin path. Long hold time or possible blocking inside the critical section requires a sleeping mutex. High contention on many cores favors queueing locks or redesign of the shared state.
Common Confusions
Atomic counter increment is not a substitute for a critical section
An atomic fetch_add fixes the lost update for one integer. It does not protect a compound invariant such as size == tail - head in a queue. If an operation updates data[i], tail, and size, the whole invariant needs one synchronization design.
Volatile is not a lock
In C and C++, volatile constrains some compiler optimizations for that object. It does not create mutual exclusion, does not make x++ atomic, and does not establish the acquire-release ordering expected from a mutex.
A fair lock can still be slow
Ticket locks are fair because tickets are served in order. Under heavy contention, all waiters poll one owner field. Fairness prevents starvation, but it does not remove coherence traffic.
Deadlock can happen with only two locks
Deadlock is not a large-system-only failure. Two mutexes and two threads are enough when acquisition order differs.
Exercises
Problem
Two threads run counter++ once. The increment compiles to load, add, store. Initial counter is 10. Give an interleaving that ends with 11, and give one interleaving that ends with 12.
Problem
A 64-byte cache line starts at address 0x2000. Fields x, y, and z are 4-byte atomics at offsets 0, 4, and 64 in a struct. Which fields share a cache line? Give the byte ranges.
Problem
A ticket lock has next = 40 and owner = 38. Threads A, B, and C call lock in that order. Then two unlocks occur. List each thread's ticket and say which thread owns the lock after the two unlocks.
Problem
Locks L1, L2, and L3 have ranks 10, 20, and 30. A code path holds L2 and then tries to acquire L1. State the rule violation and rewrite the acquisition order for a path that needs all three locks.
References
Canonical:
- Arpaci-Dusseau and Arpaci-Dusseau, Operating Systems: Three Easy Pieces (2018), ch. 26-32, covers threads, locks, condition variables, semaphores, and common concurrency bugs
- Silberschatz, Galvin, and Gagne, Operating System Concepts (10th ed., 2018), ch. 6-8, covers synchronization tools, deadlocks, and CPU scheduling context
- Hennessy and Patterson, Computer Architecture: A Quantitative Approach (6th ed., 2019), ch. 5, covers cache hierarchy, coherence, and multiprocessor behavior
- Herlihy and Shavit, The Art of Multiprocessor Programming (rev. 1st ed., 2012), ch. 2, 7, 8, covers mutual exclusion, spin locks, queue locks, and progress conditions
- Kerrisk, The Linux Programming Interface (2010), ch. 29-33 and 52, covers POSIX threads, mutexes, condition variables, and futex system-call context
Accessible:
- Arpaci-Dusseau and Arpaci-Dusseau, Operating Systems: Three Easy Pieces, free online chapters on threads and locks
- Ulrich Drepper, Futexes Are Tricky (2011), practical notes on futex races and wakeups
- Paul E. McKenney, Is Parallel Programming Hard, And, If So, What Can You Do About It?, open book on locking, memory ordering, and cache effects
Next Topics
/computationpath/condition-variables/computationpath/semaphores-and-monitors/computationpath/memory-consistency-models/computationpath/cache-coherence