Published on

Journey of a Request

Authors

NOTE

This post is part of the Distributed Systems in Practice series: real-world explorations of the edge cases, ambiguity, and invisible complexity of running software across machines.


It begins innocently: a node emits an RPC to fetch a weather report. Thirty seconds elapse. Then — TimeoutException. What exactly failed?

It’s unknowable.

Perhaps the request was dropped. Perhaps the remote node crashed mid-processing. Perhaps the response returned, but was throttled by congestion or delayed in a queue. Once a request exits your system boundary, its fate is hidden — obscured by a thicket of physical infrastructure, operating systems, and protocols.

This opacity is not incidental. It is foundational to distributed systems.


The Stack Beneath

At the top of the stack, the application composes an RPC — often over gRPC, using HTTP/2 and Protocol Buffers. At Indeed, legacy protocols like BoxCar have been deprecated in favor of gRPC for its stronger tooling and standardization.

This RPC is handed down to the transport layer. TCP fragments it into segments and ensures their ordered delivery. IP wraps each segment into a packet. The link layer encapsulates packets into frames, appending MAC addresses and checksums. And finally, the bits traverse copper, fiber, or air.

At every step, failure is possible. Bit flips. Misrouted packets. Congestion. Packet loss. Drops.


On Loss and Delay

Cables get severed — sometimes by construction equipment, sometimes by sharks. Data center routers fail, switches brown out, BGP routes flap. DNS queries time out. Application servers crash. Sometimes requests disappear because the network stack on the receiving VM was not ready to accept new connections.

But not all failures are terminal. Many are simply delays. Congestion on the switch. Queues building in NIC buffers. CPU contention delaying thread scheduling. VMs waiting on hypervisors for compute slices. Cloud tenants saturating shared bandwidth. These are not malfunctions — they are operational realities.

Timeouts, therefore, are epistemological failures. They don’t mean the request failed. They mean the system gave up waiting.


The Mirage of Timeliness

Distributed systems operate without global clocks. Lamport showed us this decades ago. What we call “time” is an approximation bounded by delay, jitter, and drift. Latency isn’t a scalar — it’s a probability distribution.

Even if a remote node is up and healthy, the path may be long or contended. Garbage collection pauses the process. TCP backpressure slows transmission. VM migration halts the guest temporarily. These are expected behaviors, not anomalies.

And yet, every additional hop, queue, or context switch becomes a potential contributor to timeout.


The Real Cost: Partial Failure

The core difficulty of distributed systems isn’t failure. It’s partial failure.

Node A believes Node B is down. B is actually alive but temporarily unreachable. When connectivity resumes, B may act on old messages, introducing state divergence. This is the origin of split-brain, duplicate actions, and violated invariants.

Consensus protocols like Paxos and Raft were invented to tame this ambiguity — to ensure that, despite faults and partitions, the system converges. But they can’t eliminate uncertainty. Only contain it.


Designing Within Uncertainty

This is the engineer’s real challenge: how to build correctness atop uncertainty.

  • Idempotency is non-negotiable. Duplicate messages must not duplicate effects.
  • Observability must be intrinsic, not bolted on. Every request path should be traceable.
  • Retries should back off intelligently, not flood the system under duress.
  • Degradation should be graceful — not catastrophic.

Resilience isn’t a property. It’s a posture. One that assumes no component, no packet, no clock can be fully trusted.


Closing Remarks

In distributed systems, ambiguity is the rule, not the exception. Timeouts are not bugs; they are expressions of our limited epistemology — symptoms of a system designed to survive despite its inability to fully know.

Our task isn’t to eliminate failure. It’s to make failure survivable, debuggable, and — eventually — boring.