Sockets Programming Essentials

Why This Matters

A TCP server with one blocking accept() and one blocking recv() can stop serving new clients when a single client connects and sends no bytes. The kernel is doing the right thing: it parks the thread until the requested event occurs. The program is wrong if it assumes one file descriptor means one complete request.

Most production networking bugs near the socket API are small boundary mistakes. A write of 4096 bytes can return 1460. A read can split GET /x\r\n\r\n across two calls. A restarted server can fail with EADDRINUSE because earlier connections are in TIME_WAIT. These are not rare cases; they are the API contract.

Core Definitions

Definition

socket file descriptor

A socket file descriptor is an integer handle returned by socket(domain, type, protocol). On Unix-like systems it participates in the file descriptor table, so read, write, close, fcntl, and readiness APIs can operate on it. The descriptor names an endpoint object in the kernel, not a packet or a connection by itself.

Definition

address family and socket type

AF_INET selects IPv4 addresses and AF_INET6 selects IPv6 addresses. SOCK_STREAM selects a reliable byte stream, normally TCP for Internet sockets. SOCK_DGRAM selects datagrams, normally UDP. The address family describes how addresses are represented; the socket type describes the service model.

Definition

blocking socket

A blocking socket operation parks the calling thread until it can make progress, completes with an error, or is interrupted by a signal. recv() on a blocking TCP socket with no buffered bytes waits. send() on a blocking TCP socket can wait when the kernel send buffer is full.

Definition

readiness notification

A readiness API reports that a descriptor is likely to accept an operation without blocking. select, poll, epoll, and kqueue are readiness interfaces. They do not promise that a full application message is available.

The Socket Call and Address Bytes

The call

int fd = socket(AF_INET, SOCK_STREAM, 0);

asks the kernel for an IPv4 TCP endpoint. The 0 protocol lets the kernel choose the default protocol for the family and type. For AF_INET with SOCK_STREAM, that is TCP. For AF_INET with SOCK_DGRAM, it is UDP.

A server usually fills struct sockaddr_in before bind():

struct sockaddr_in a = {0};
a.sin_family = AF_INET;
a.sin_port = htons(8080);
a.sin_addr.s_addr = htonl(INADDR_LOOPBACK);  // 127.0.0.1

On a little-endian Linux machine, this 16-byte structure for 127.0.0.1:8080 is commonly laid out as:

offset  bytes                         meaning
0       02 00                         AF_INET, host endian 2
2       1f 90                         TCP port 8080, network byte order
4       7f 00 00 01                   IPv4 address 127.0.0.1
8       00 00 00 00 00 00 00 00       padding sin_zero

The mixed byte order is intentional. sin_family is local ABI data. sin_port and sin_addr are protocol fields and use network byte order, most significant byte first. The port calculation is $8080 = 0x1f90$ , so the bytes on the wire are 1f 90, not 90 1f.

For IPv6, use AF_INET6 and struct sockaddr_in6. In new code, getaddrinfo() is preferred because it returns a list of address structures for names, IPv4, IPv6, and service names without hard-coding structure sizes.

Server Lifecycle: bind, listen, accept

A TCP server goes through a fixed sequence.

int fd = socket(AF_INET, SOCK_STREAM, 0);
bind(fd, (struct sockaddr *)&addr, sizeof addr);
listen(fd, 128);
int cfd = accept(fd, (struct sockaddr *)&peer, &peerlen);

bind() assigns a local address and port to the listening socket. If the port is 0, the kernel chooses an ephemeral port. listen() marks the socket as passive and creates kernel queues for connection setup and completed connections. The backlog argument, such as 128, is not a portable exact queue length, but it is the application’s requested bound for pending accepted connections.

accept() returns a new connected socket. The listening descriptor remains open and continues to accept more connections. This distinction matters:

fd  = 3  listening socket, local 0.0.0.0:8080
cfd = 4  connected socket, local 127.0.0.1:8080, peer 127.0.0.1:51342

Set SO_REUSEADDR before bind() during development and for many TCP servers:

int yes = 1;
setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &yes, sizeof yes);

This option reduces pain from TIME_WAIT sockets when a process restarts and binds the same local port. It is not a license to bind two unrelated active servers to the same exact 4-tuple. The exact behavior varies by OS, but the safe habit is stable: set it before bind().

Client Lifecycle: socket and connect

A TCP client usually needs only socket() and connect().

int fd = socket(AF_INET, SOCK_STREAM, 0);
connect(fd, (struct sockaddr *)&server, sizeof server);

If the client does not call bind(), the kernel picks a source address and ephemeral source port. A completed connection is identified by the 4-tuple:

client ip      192.0.2.10
client port    51342
server ip      198.51.100.7
server port    443

Two clients on the same host can connect to the same server port because their local ephemeral ports differ. A TCP byte stream has no message boundaries. If the client writes 5 bytes and then 7 bytes, the server might read 12 bytes once, 3 then 9 bytes, or any split that preserves byte order.

send, recv, read, and write

On Unix-like systems, read(fd, buf, n) and write(fd, buf, n) work on stream sockets. recv() and send() add socket-specific flags, such as MSG_DONTWAIT, MSG_NOSIGNAL, and MSG_PEEK.

Use loops for fixed-size records. A correct sender treats a positive short return as progress, not failure:

#include <sys/socket.h>
#include <errno.h>
#include <stddef.h>

int send_all(int fd, const void *buf, size_t n) {
    const char *p = (const char *)buf;
    while (n > 0) {
        ssize_t k = send(fd, p, n, MSG_NOSIGNAL);
        if (k > 0) { p += k; n -= (size_t)k; continue; }
        if (k < 0 && errno == EINTR) continue;
        return -1;
    }
    return 0;
}

A numeric example shows why the loop is mandatory. Suppose the user buffer has 4096 bytes. The first send() returns 1460 because the peer window and local send buffer accept one segment. The second returns 1460. The third returns 1176. The correct total is $1460 + 1460 + 1176 = 4096$ . Treating the first return as completion silently drops 2636 bytes at the application layer.

Reads have a second boundary: recv() returning 0 on a TCP stream means orderly shutdown by the peer. It is not an empty message.

Worked Example: Minimal TCP Echo Server

This server accepts one client at a time. It is intentionally blocking. The loops around recv() and send() are the part to keep.

// cc -Wall -Wextra -O2 echo.c -o echo
#include <arpa/inet.h>
#include <errno.h>
#include <netinet/in.h>
#include <string.h>
#include <sys/socket.h>
#include <unistd.h>
#include <stdio.h>

static int send_all(int fd, const char *p, size_t n) {
    while (n) {
        ssize_t k = send(fd, p, n, MSG_NOSIGNAL);
        if (k > 0) { p += k; n -= (size_t)k; continue; }
        if (k < 0 && errno == EINTR) continue;
        return -1;
    }
    return 0;
}

int main(void) {
    int s = socket(AF_INET, SOCK_STREAM, 0);
    int yes = 1;
    setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &yes, sizeof yes);

    struct sockaddr_in a = {0};
    a.sin_family = AF_INET;
    a.sin_port = htons(8080);
    a.sin_addr.s_addr = htonl(INADDR_ANY);

    bind(s, (struct sockaddr *)&a, sizeof a);
    listen(s, 128);

    for (;;) {
        int c = accept(s, NULL, NULL);
        if (c < 0) continue;
        char buf[4096];
        for (;;) {
            ssize_t n = recv(c, buf, sizeof buf, 0);
            if (n > 0) {
                if (send_all(c, buf, (size_t)n) < 0) break;
            } else if (n == 0) {
                break;
            } else if (errno != EINTR) {
                break;
            }
        }
        close(c);
    }
}

Test with nc 127.0.0.1 8080. This program serializes clients. If one client connects and never sends data, the process waits inside recv() and does not call accept() again.

Blocking, Non-Blocking, and Readiness APIs

A non-blocking descriptor returns -1 with errno == EAGAIN or EWOULDBLOCK when an operation would block.

#include <fcntl.h>
int flags = fcntl(fd, F_GETFL, 0);
fcntl(fd, F_SETFL, flags | O_NONBLOCK);

select() uses fixed-size bit sets and is limited by FD_SETSIZE in typical builds. poll() uses an array of struct pollfd and avoids bit-set limits, but each call still passes the full array to the kernel.

Linux epoll separates registration from waiting. A common shape is:

int ep = epoll_create1(EPOLL_CLOEXEC);
struct epoll_event ev = { .events = EPOLLIN, .data.fd = fd };
epoll_ctl(ep, EPOLL_CTL_ADD, fd, &ev);

for (;;) {
    struct epoll_event out[64];
    int n = epoll_wait(ep, out, 64, -1);
    for (int i = 0; i < n; i++) handle(out[i].data.fd);
}

Level-triggered epoll repeats an event while the condition remains true. If a socket has 100 bytes buffered and your handler reads 20, the next epoll_wait() can report it again.

Edge-triggered epoll reports transitions, such as “not readable” to “readable.” With EPOLLET, handlers must drain until EAGAIN:

for (;;) {
    ssize_t n = recv(fd, buf, sizeof buf, 0);
    if (n > 0) consume(buf, (size_t)n);
    else if (n == 0) { close(fd); break; }
    else if (errno == EAGAIN || errno == EWOULDBLOCK) break;
    else if (errno != EINTR) { close(fd); break; }
}

BSD kqueue uses filters such as EVFILT_READ and returns events through kevent(). Windows IOCP is a completion model rather than Unix readiness. The caller posts operations and later receives completed operations. Detailed Windows async I/O is outside this page.

io_uring is Linux’s modern submission and completion ring interface. User space writes submission queue entries, calls io_uring_enter() when needed, and later reads completion queue entries. For sockets, it can submit accepts, connects, sends, receives, and poll-style waits. The mental model is different from epoll: the program usually tracks operations in flight, not only descriptors that are ready.

Key Result

A correct stream-socket program maintains two invariants.

First, application framing is separate from TCP. If the protocol says each record has a 4-byte big-endian length followed by payload, the receiver must first read exactly 4 bytes, decode length $L$ , then read exactly $L$ bytes. TCP does not preserve write boundaries.

Second, readiness is not completion. If an API reports readable, one recv() may return 1 byte, 4096 bytes, EAGAIN, or 0, depending on races, buffering, and peer shutdown. The handler is correct only if it accepts all of those outcomes.

For a length-prefixed record with payload length 300, the bytes start as:

00 00 01 2c  ...300 payload bytes...

A possible receive sequence is:

recv #1: 2 bytes   00 00
recv #2: 5 bytes   01 2c 48 65 6c
recv #3: 297 bytes remaining

The parser state after the second call is “length known as 300, payload has 3 bytes.” Treating the second receive as a complete record corrupts the stream.

Common Confusions

Watch Out

A successful send means the peer received the bytes

send() returning 4096 means the kernel accepted 4096 bytes into the local TCP send path. It does not mean the peer application read them. A later reset can still arrive if the peer process died or closed the connection with unread data.

Watch Out

One write corresponds to one recv

TCP is a byte stream. Three client writes of 10 bytes can be observed as one server read of 30 bytes. One client write of 30 bytes can be observed as three reads of 10 bytes. Preserve byte order, not message boundaries.

Watch Out

Edge-triggered epoll is faster without other changes

EPOLLET changes the obligation on the handler. If the handler reads one chunk and returns while bytes remain, no later edge is required. Use non-blocking descriptors and drain until EAGAIN.

Exercises

ExerciseCore

Problem

Construct the 16 bytes of a Linux sockaddr_in for IPv4 address 192.0.2.5, port 5000, and AF_INET == 2 on a little-endian host. Assume the padding bytes are zero.

ExerciseCore

Problem

A non-blocking TCP socket is registered with level-triggered epoll for EPOLLIN. The kernel receive buffer holds 90 bytes. Your handler reads 32 bytes once and returns. What happens on the next epoll_wait() if no other thread reads from the socket?

ExerciseAdvanced

Problem

Write pseudocode for reading one length-prefixed record where the first 4 bytes are a big-endian unsigned length. The function must tolerate short reads and return failure on EOF before the full record.

References

Canonical:

Stevens, Fenner, and Rudoff, UNIX Network Programming, Volume 1 (2004), ch. 3, socket address structures and byte order
Stevens, Fenner, and Rudoff, UNIX Network Programming, Volume 1 (2004), ch. 4-6, elementary TCP socket calls and I/O models
Fall and Stevens, TCP/IP Illustrated, Volume 1 (2011), ch. 13, TCP connection management and state transitions
Fall and Stevens, TCP/IP Illustrated, Volume 1 (2011), ch. 17-18, TCP timeout, retransmission, and flow behavior
Tanenbaum and Wetherall, Computer Networks (2011), ch. 6, transport-layer services and TCP
Kerrisk, The Linux Programming Interface (2010), ch. 56, sockets API on Linux

Accessible:

Beej Jorgensen, Beej's Guide to Network Programming, sections 5-7
Linux man-pages project, socket(2), bind(2), listen(2), accept(2), connect(2), recv(2), send(2), epoll(7)
liburing project documentation, io_uring man pages and examples

Next Topics

/computationpath/socket-server-lab
/computationpath/tcp-state-machine
/computationpath/event-loops-and-reactors
/computationpath/io-uring-essentials