Inside Amazon’s new data center network architecture: quasi random network topology and passive optical devices
Amazon claims it recently achieved a major breakthrough in Data Center Network (DCN) architecture and has been quietly deploying the new technology in its data centers since late last year. Amazon detailed its new networking design in a paper published last month titled “RNG: Flat Data Center Networks at Scale.” RNG, or “resilient network graphs,” is built around a quasi-random topology and new passive optical hardware. It’s a “quasi-random” design that combines elements of traditional, structured data networks with the performance advantages of more random architectures.
The goal is to move off conventional hierarchical “fat-tree” designs toward a flatter, more mesh-like fabric that uses far fewer routers and switches, offers more parallel paths, and therefore delivers higher effective throughput at lower power and capex.
“RNG is a great fit for our core demands, but AI training data patterns are far more coordinated and centrally orchestrated, so they don’t approximate a random graph.. By essentially flattening the network, we eliminated the bottlenecks that come with traditional networking designs,” Matt Rehder, vice president of AWS Network Engineering, said in an exclusive interview with WIRED. “We think we’re the only ones who have done this at scale.”
The fact that Amazon is using this in the real world is “remarkable,” said Brighten Godfrey, a computer science professor at the University of Illinois Urbana-Champaign and an expert in networking, who was not involved in Amazon’s research. Godfrey coauthored a seminal 2012 paper on random network graphs, which he says are a “mind-bending problem to solve, in general.”
Classic cloud DCNs use structured topologies (Clos/fat-tree) where paths are highly regular and layered (ToR–aggregation–core). By contrast, random-graph theory says the most efficient routing networks are flat random graphs: each node connects to a small random subset of others, creating many short, diverse paths and graceful degradation under failures. The problem has always been practical: random cabling at scale is unmanageable, and routing across a huge random graph is nontrivial.
AWS’s “quasi-random” design essentially mixes determinism with randomness: key structural elements are fixed to keep the cabling and deployment manageable, while enough randomness is retained in the interconnect pattern to get the performance and resilience benefits of random graphs. The physical enabler is a new passive optical device called a ShuffleBox that standardizes how switches connect and internally permutes links so that, when many ShuffleBoxes are wired together, the resulting global topology is quasi-random without having to hand-design every link.

Image Credit: Amazon
………………………………………………………………………………………………………………………….
Key architectural pieces and claimed gains:
AWS reports that RNG-based fabrics now serve as the default network architecture for most new AWS data centers, after initial deployments beginning in 2024. The company claims the design:
-
Uses roughly 69% fewer routers/switches than traditional fat-tree DCNs, because the network is flatter and relies more on passive optical fanout.
-
Delivers up to about 33% higher throughput, due to more independent paths and better load spreading.
-
Cuts network equipment power consumption by on the order of 40%, with associated reductions in cooling and operational overhead.
On the control-plane side, AWS developed a routing scheme called Spraypoint. Instead of always following a strict shortest path from source to destination, Spraypoint first “sprays” traffic randomly to neighbors, then directs it via preselected “waypoints” using more conventional shortest-path routing. This hybrid behavior exploits the quasi-random topology to open many more independent paths than standard ECMP-style shortest-path routing would, which in turn improves utilization and resilience under congestion or failures.
Strategic implications:
For AWS’s cloud and AI build-out, this is positioned as a foundational infrastructure advantage: higher bisection bandwidth and lower network energy per bit directly benefit large-scale AI training clusters, storage backends, and multi-tenant cloud workloads. Fewer active devices and more passive optics also translate into lower capex and opex at hyperscale, so AWS is framing this as both a performance and cost/sustainability play that could save billions of dollars and reduce CO₂ emissions over time.
From a networking-theory standpoint, this is notable as one of the first reported at-scale, production deployments of a flat random-graph-inspired topology in a hyperscale DCN, rather than a purely academic or lab system.
In a quasi-random topology like AWS’s RNG fabric, the impact on latency and jitter comes from three main effects: path length distribution, load spreading, and failure behavior.
Baseline latency: path lengths and device count:
In a traditional Clos/fat-tree, average latency is dominated by a fixed number of stages (ToR → agg → core → agg → ToR), so hop count is tightly controlled but you pay for many active devices. A quasi-random, flat graph replaces that rigid hierarchy with many short, irregular paths; on average, shortest paths between any two switches are similar or slightly shorter in hop count than in a fat-tree, and there are fewer active routers in the path because the architecture offloads fanout to passive optics. That tends to keep or slightly reduce median/mean latency per flow, especially under moderate load, because packets traverse fewer serialized queueing points even if the physical graph looks “messier.”
Jitter: congestion and path diversity:
Jitter is driven much more by variable queueing delay than by fixed propagation or serialization. In a quasi-random fabric with many alternate paths and a load-balancing scheme like Spraypoint (random spray + waypoint-based shortest paths), flows can be spread more evenly across the network, reducing hot spots and thus reducing the variance of queueing delay across packets. That can lower jitter compared with a Clos under the same aggregate load, because the system is less likely to funnel many flows through the same few congested uplinks or spine devices.
However, because the routing intentionally uses many different paths, per-flow packet reordering becomes more likely unless constrained by per-flow hashing or waypointing, which can show up as effective jitter at higher layers. AWS’s description of Spraypoint suggests they mitigate this by using waypoints and policy to preserve some path structure, so you get the diversity benefits without unconstrained per-packet spraying.
Under failure and high load:
Where quasi-random really helps latency/jitter is under failure and partial congestion. In a Clos, link or spine failures can force large sets of flows to converge on a smaller subset of remaining equal-cost paths, driving up queueing delay and jitter nonlinearly. In a resilient random-graph-style fabric, node/edge failures simply remove a few edges from a highly connected graph; there are typically many alternative short paths, so the increase in hop count and queueing pressure is smaller and more diffuse. That tends to keep tail latency and jitter (P99, P99.9) better behaved, even if median latency looks similar to a Clos at low load.
So, qualitatively: median latency is roughly comparable to a well-designed Clos, sometimes better due to fewer active stages; jitter and tail latency should improve under realistic, bursty load and failure scenarios, provided the routing stack is designed to limit packet reordering.
Summary and Conclusions:
Quasi-random data center topologies like AWS’s RNG fabric replace rigid Clos/fat-tree hierarchies with a flatter, graph-like network that preserves short path lengths while dramatically increasing path diversity, which tends to hold median latency roughly steady or slightly better by reducing the number of active, queueing devices per path and offloading fanout to passive optics. They primarily improve jitter and tail latency by spreading flows across many alternative routes so congestion is less concentrated, making queueing delays less bursty and keeping P99/P99.9 behavior more stable under failures and hot spots, provided the routing layer (for example, AWS’s Spraypoint approach) constrains packet reordering through way pointing or per-flow consistency.
The conclusion is that quasi-random fabrics are less about shaving a few microseconds off baseline latency and more about delivering more predictable end-to-end performance—especially for east–west, latency-sensitive cloud and AI workloads—by trading rigid structure for statistically robust, highly connected graphs that degrade more gracefully when links, nodes, or traffic patterns become pathological.
…………………………………………………………………………………………………………………………………………………………………….
References:
https://www.wired.com/story/amazon-aws-ceo-matt-garman-ai-agents/

