Why This Matters
A TCP server with one blocking accept() and one blocking recv() can stop serving new clients when a single client connects and sends no bytes. The kernel is doing the right thing: it parks the thread until the requested event occurs. The program is wrong if it assumes one file descriptor means one complete request.
Most production networking bugs near the socket API are small boundary mistakes. A write of 4096 bytes can return 1460. A read can split GET /x\r\n\r\n across two calls. A restarted server can fail with EADDRINUSE because earlier connections are in TIME_WAIT. These are not rare cases; they are the API contract.
Core Definitions
socket file descriptor
A socket file descriptor is an integer handle returned by socket(domain, type, protocol). On Unix-like systems it participates in the file descriptor table, so read, write, close, fcntl, and readiness APIs can operate on it. The descriptor names an endpoint object in the kernel, not a packet or a connection by itself.
address family and socket type
AF_INET selects IPv4 addresses and AF_INET6 selects IPv6 addresses. SOCK_STREAM selects a reliable byte stream, normally TCP for Internet sockets. SOCK_DGRAM selects datagrams, normally UDP. The address family describes how addresses are represented; the socket type describes the service model.
blocking socket
A blocking socket operation parks the calling thread until it can make progress, completes with an error, or is interrupted by a signal. recv() on a blocking TCP socket with no buffered bytes waits. send() on a blocking TCP socket can wait when the kernel send buffer is full.
readiness notification
A readiness API reports that a descriptor is likely to accept an operation without blocking. select, poll, epoll, and kqueue are readiness interfaces. They do not promise that a full application message is available.
The Socket Call and Address Bytes
The call
int fd = socket(AF_INET, SOCK_STREAM, 0);
asks the kernel for an IPv4 TCP endpoint. The 0 protocol lets the kernel choose the default protocol for the family and type. For AF_INET with SOCK_STREAM, that is TCP. For AF_INET with SOCK_DGRAM, it is UDP.
A server usually fills struct sockaddr_in before bind():
struct sockaddr_in a = {0};
a.sin_family = AF_INET;
a.sin_port = htons(8080);
a.sin_addr.s_addr = htonl(INADDR_LOOPBACK); // 127.0.0.1
On a little-endian Linux machine, this 16-byte structure for 127.0.0.1:8080 is commonly laid out as:
offset bytes meaning
0 02 00 AF_INET, host endian 2
2 1f 90 TCP port 8080, network byte order
4 7f 00 00 01 IPv4 address 127.0.0.1
8 00 00 00 00 00 00 00 00 padding sin_zero
The mixed byte order is intentional. sin_family is local ABI data. sin_port and sin_addr are protocol fields and use network byte order, most significant byte first. The port calculation is $8080 = 0x1f90$, so the bytes on the wire are 1f 90, not 90 1f.
For IPv6, use AF_INET6 and struct sockaddr_in6. In new code, getaddrinfo() is preferred because it returns a list of address structures for names, IPv4, IPv6, and service names without hard-coding structure sizes.
Server Lifecycle: bind, listen, accept
A TCP server goes through a fixed sequence.
int fd = socket(AF_INET, SOCK_STREAM, 0);
bind(fd, (struct sockaddr *)&addr, sizeof addr);
listen(fd, 128);
int cfd = accept(fd, (struct sockaddr *)&peer, &peerlen);
bind() assigns a local address and port to the listening socket. If the port is 0, the kernel chooses an ephemeral port. listen() marks the socket as passive and creates kernel queues for connection setup and completed connections. The backlog argument, such as 128, is not a portable exact queue length, but it is the application’s requested bound for pending accepted connections.
accept() returns a new connected socket. The listening descriptor remains open and continues to accept more connections. This distinction matters:
fd = 3 listening socket, local 0.0.0.0:8080
cfd = 4 connected socket, local 127.0.0.1:8080, peer 127.0.0.1:51342
Set SO_REUSEADDR before bind() during development and for many TCP servers:
int yes = 1;
setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &yes, sizeof yes);
This option reduces pain from TIME_WAIT sockets when a process restarts and binds the same local port. It is not a license to bind two unrelated active servers to the same exact 4-tuple. The exact behavior varies by OS, but the safe habit is stable: set it before bind().
Client Lifecycle: socket and connect
A TCP client usually needs only socket() and connect().
int fd = socket(AF_INET, SOCK_STREAM, 0);
connect(fd, (struct sockaddr *)&server, sizeof server);
If the client does not call bind(), the kernel picks a source address and ephemeral source port. A completed connection is identified by the 4-tuple:
client ip 192.0.2.10
client port 51342
server ip 198.51.100.7
server port 443
Two clients on the same host can connect to the same server port because their local ephemeral ports differ. A TCP byte stream has no message boundaries. If the client writes 5 bytes and then 7 bytes, the server might read 12 bytes once, 3 then 9 bytes, or any split that preserves byte order.
send, recv, read, and write
On Unix-like systems, read(fd, buf, n) and write(fd, buf, n) work on stream sockets. recv() and send() add socket-specific flags, such as MSG_DONTWAIT, MSG_NOSIGNAL, and MSG_PEEK.
Use loops for fixed-size records. A correct sender treats a positive short return as progress, not failure:
#include <sys/socket.h>
#include <errno.h>
#include <stddef.h>
int send_all(int fd, const void *buf, size_t n) {
const char *p = (const char *)buf;
while (n > 0) {
ssize_t k = send(fd, p, n, MSG_NOSIGNAL);
if (k > 0) { p += k; n -= (size_t)k; continue; }
if (k < 0 && errno == EINTR) continue;
return -1;
}
return 0;
}
A numeric example shows why the loop is mandatory. Suppose the user buffer has 4096 bytes. The first send() returns 1460 because the peer window and local send buffer accept one segment. The second returns 1460. The third returns 1176. The correct total is $1460 + 1460 + 1176 = 4096$. Treating the first return as completion silently drops 2636 bytes at the application layer.
Reads have a second boundary: recv() returning 0 on a TCP stream means orderly shutdown by the peer. It is not an empty message.
Worked Example: Minimal TCP Echo Server
This server accepts one client at a time. It is intentionally blocking. The loops around recv() and send() are the part to keep.
// cc -Wall -Wextra -O2 echo.c -o echo
#include <arpa/inet.h>
#include <errno.h>
#include <netinet/in.h>
#include <string.h>
#include <sys/socket.h>
#include <unistd.h>
#include <stdio.h>
static int send_all(int fd, const char *p, size_t n) {
while (n) {
ssize_t k = send(fd, p, n, MSG_NOSIGNAL);
if (k > 0) { p += k; n -= (size_t)k; continue; }
if (k < 0 && errno == EINTR) continue;
return -1;
}
return 0;
}
int main(void) {
int s = socket(AF_INET, SOCK_STREAM, 0);
int yes = 1;
setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &yes, sizeof yes);
struct sockaddr_in a = {0};
a.sin_family = AF_INET;
a.sin_port = htons(8080);
a.sin_addr.s_addr = htonl(INADDR_ANY);
bind(s, (struct sockaddr *)&a, sizeof a);
listen(s, 128);
for (;;) {
int c = accept(s, NULL, NULL);
if (c < 0) continue;
char buf[4096];
for (;;) {
ssize_t n = recv(c, buf, sizeof buf, 0);
if (n > 0) {
if (send_all(c, buf, (size_t)n) < 0) break;
} else if (n == 0) {
break;
} else if (errno != EINTR) {
break;
}
}
close(c);
}
}
Test with nc 127.0.0.1 8080. This program serializes clients. If one client connects and never sends data, the process waits inside recv() and does not call accept() again.
Blocking, Non-Blocking, and Readiness APIs
A non-blocking descriptor returns -1 with errno == EAGAIN or EWOULDBLOCK when an operation would block.
#include <fcntl.h>
int flags = fcntl(fd, F_GETFL, 0);
fcntl(fd, F_SETFL, flags | O_NONBLOCK);
select() uses fixed-size bit sets and is limited by FD_SETSIZE in typical builds. poll() uses an array of struct pollfd and avoids bit-set limits, but each call still passes the full array to the kernel.
Linux epoll separates registration from waiting. A common shape is:
int ep = epoll_create1(EPOLL_CLOEXEC);
struct epoll_event ev = { .events = EPOLLIN, .data.fd = fd };
epoll_ctl(ep, EPOLL_CTL_ADD, fd, &ev);
for (;;) {
struct epoll_event out[64];
int n = epoll_wait(ep, out, 64, -1);
for (int i = 0; i < n; i++) handle(out[i].data.fd);
}
Level-triggered epoll repeats an event while the condition remains true. If a socket has 100 bytes buffered and your handler reads 20, the next epoll_wait() can report it again.
Edge-triggered epoll reports transitions, such as “not readable” to “readable.” With EPOLLET, handlers must drain until EAGAIN:
for (;;) {
ssize_t n = recv(fd, buf, sizeof buf, 0);
if (n > 0) consume(buf, (size_t)n);
else if (n == 0) { close(fd); break; }
else if (errno == EAGAIN || errno == EWOULDBLOCK) break;
else if (errno != EINTR) { close(fd); break; }
}
BSD kqueue uses filters such as EVFILT_READ and returns events through kevent(). Windows IOCP is a completion model rather than Unix readiness. The caller posts operations and later receives completed operations. Detailed Windows async I/O is outside this page.
io_uring is Linux’s modern submission and completion ring interface. User space writes submission queue entries, calls io_uring_enter() when needed, and later reads completion queue entries. For sockets, it can submit accepts, connects, sends, receives, and poll-style waits. The mental model is different from epoll: the program usually tracks operations in flight, not only descriptors that are ready.
Key Result
A correct stream-socket program maintains two invariants.
First, application framing is separate from TCP. If the protocol says each record has a 4-byte big-endian length followed by payload, the receiver must first read exactly 4 bytes, decode length $L$, then read exactly $L$ bytes. TCP does not preserve write boundaries.
Second, readiness is not completion. If an API reports readable, one recv() may return 1 byte, 4096 bytes, EAGAIN, or 0, depending on races, buffering, and peer shutdown. The handler is correct only if it accepts all of those outcomes.
For a length-prefixed record with payload length 300, the bytes start as:
00 00 01 2c ...300 payload bytes...
A possible receive sequence is:
recv #1: 2 bytes 00 00
recv #2: 5 bytes 01 2c 48 65 6c
recv #3: 297 bytes remaining
The parser state after the second call is “length known as 300, payload has 3 bytes.” Treating the second receive as a complete record corrupts the stream.
Common Confusions
A successful send means the peer received the bytes
send() returning 4096 means the kernel accepted 4096 bytes into the local TCP send path. It does not mean the peer application read them. A later reset can still arrive if the peer process died or closed the connection with unread data.
One write corresponds to one recv
TCP is a byte stream. Three client writes of 10 bytes can be observed as one server read of 30 bytes. One client write of 30 bytes can be observed as three reads of 10 bytes. Preserve byte order, not message boundaries.
Edge-triggered epoll is faster without other changes
EPOLLET changes the obligation on the handler. If the handler reads one chunk and returns while bytes remain, no later edge is required. Use non-blocking descriptors and drain until EAGAIN.
Exercises
Problem
Construct the 16 bytes of a Linux sockaddr_in for IPv4 address 192.0.2.5, port 5000, and AF_INET == 2 on a little-endian host. Assume the padding bytes are zero.
Problem
A non-blocking TCP socket is registered with level-triggered epoll for EPOLLIN. The kernel receive buffer holds 90 bytes. Your handler reads 32 bytes once and returns. What happens on the next epoll_wait() if no other thread reads from the socket?
Problem
Write pseudocode for reading one length-prefixed record where the first 4 bytes are a big-endian unsigned length. The function must tolerate short reads and return failure on EOF before the full record.
References
Canonical:
- Stevens, Fenner, and Rudoff, UNIX Network Programming, Volume 1 (2004), ch. 3, socket address structures and byte order
- Stevens, Fenner, and Rudoff, UNIX Network Programming, Volume 1 (2004), ch. 4-6, elementary TCP socket calls and I/O models
- Fall and Stevens, TCP/IP Illustrated, Volume 1 (2011), ch. 13, TCP connection management and state transitions
- Fall and Stevens, TCP/IP Illustrated, Volume 1 (2011), ch. 17-18, TCP timeout, retransmission, and flow behavior
- Tanenbaum and Wetherall, Computer Networks (2011), ch. 6, transport-layer services and TCP
- Kerrisk, The Linux Programming Interface (2010), ch. 56, sockets API on Linux
Accessible:
- Beej Jorgensen, Beej's Guide to Network Programming, sections 5-7
- Linux man-pages project,
socket(2),bind(2),listen(2),accept(2),connect(2),recv(2),send(2),epoll(7) - liburing project documentation,
io_uringman pages and examples
Next Topics
/computationpath/socket-server-lab/computationpath/tcp-state-machine/computationpath/event-loops-and-reactors/computationpath/io-uring-essentials