Inside Amazon’s new data center network architecture: quasi random network topology and passive optical devices

Amazon Web Services (AWS) claims it recently achieved a major breakthrough in Data Center Network (DCN) architecture and has been quietly deploying the new technology in its data centers since late last year. Amazon detailed its new networking design in a paper published May 21st titled “RNG: Flat Data Center Networks at Scale.” RNG, or “resilient network graphs,” is built around a quasi-random topology and new passive optical hardware. It’s a “quasi-random” design that combines elements of traditional, structured data networks with the performance advantages of more random architectures.

The goal is to move off conventional hierarchical “fat-tree” designs toward a flatter, more mesh-like fabric that uses far fewer routers and switches, offers more parallel paths, and therefore delivers higher effective throughput at lower power and capex.

“By essentially flattening the network, we eliminated the bottlenecks that come with traditional networking designs,” Matt Rehder, vice president of AWS Network Engineering, said in an exclusive interview with WIRED. “We think we’re the only ones who have done this at scale. RNG is a great fit for our core demands, but AI training data patterns are far more coordinated and centrally orchestrated, so they don’t approximate a random graph.”

The fact that Amazon is using this in the real world is “remarkable,” said Brighten Godfrey, a computer science professor at the University of Illinois Urbana-Champaign and an expert in networking, who was not involved in Amazon’s research. Godfrey coauthored a seminal 2012 paper on random network graphs, which he says are a “mind-bending problem to solve, in general.”

Classic cloud DCNs use structured topologies (Clos/fat-tree) where paths are highly regular and layered (Top of Rack (ToR)–aggregation–core). By contrast, random-graph theory says the most efficient routing networks are flat random graphs: each node connects to a small random subset of others, creating many short, diverse paths and graceful degradation under failures. The problem has always been practical: random cabling at scale is unmanageable, and routing across a huge random graph is nontrivial.

AWS’s “quasi-random” design essentially mixes determinism with randomness: key structural elements are fixed to keep the cabling and deployment manageable, while enough randomness is retained in the interconnect pattern to get the performance and resilience benefits of random graphs. The physical enabler is a new passive optical device called a ShuffleBox that standardizes how switches connect and internally permutes links so that, when many ShuffleBoxes are wired together, the resulting global topology is quasi-random without having to hand-design every link.

Image Credit: Amazon

………………………………………………………………………………………………………………………….

Key architectural pieces and claimed gains:

AWS reports that RNG-based fabrics now serve as the default network architecture for most new AWS data centers, after initial deployments beginning in 2024. The company claims the design:

Uses roughly 69% fewer routers/switches than traditional fat-tree DCNs, because the network is flatter and relies more on passive optical fanout.
Delivers up to about 33% higher throughput, due to more independent paths and better load spreading.
Cuts network equipment power consumption by on the order of 40%, with associated reductions in cooling and operational overhead.

On the control-plane side, AWS developed a routing scheme called Spraypoint. Instead of always following a strict shortest path from source to destination, Spraypoint first “sprays” traffic randomly to neighbors, then directs it via preselected “waypoints” using more conventional shortest-path routing. This hybrid behavior exploits the quasi-random topology to open many more independent paths than standard ECMP-style shortest-path routing would, which in turn improves utilization and resilience under congestion or failures.

Strategic implications:

For AWS’s cloud and AI build-out, this is positioned as a foundational infrastructure advantage: higher bisection bandwidth and lower network energy per bit directly benefit large-scale AI training clusters, storage backends, and multi-tenant cloud workloads. Fewer active devices and more passive optics also translate into lower capex and opex at hyperscale, so AWS is framing this as both a performance and cost/sustainability play that could save billions of dollars and reduce CO₂ emissions over time.

From a networking-theory standpoint, this is notable as one of the first reported at-scale, production deployments of a flat random-graph-inspired topology in a hyperscale DCN, rather than a purely academic or lab system.

In a quasi-random topology like AWS’s RNG fabric, the impact on latency and jitter comes from three main effects: path length distribution, load spreading, and failure behavior.

Baseline latency: path lengths and device count:

In a traditional Clos/fat-tree, average latency is dominated by a fixed number of stages (ToR → agg → core → agg → ToR), so hop count is tightly controlled but you pay for many active devices. A quasi-random, flat graph replaces that rigid hierarchy with many short, irregular paths; on average, shortest paths between any two switches are similar or slightly shorter in hop count than in a fat-tree, and there are fewer active routers in the path because the architecture offloads fanout to passive optics. That tends to keep or slightly reduce median/mean latency per flow, especially under moderate load, because packets traverse fewer serialized queueing points even if the physical graph looks “messier.”

Jitter: congestion and path diversity:

Jitter is driven much more by variable queueing delay than by fixed propagation or serialization. In a quasi-random fabric with many alternate paths and a load-balancing scheme like Spraypoint (random spray + waypoint-based shortest paths), flows can be spread more evenly across the network, reducing hot spots and thus reducing the variance of queueing delay across packets. That can lower jitter compared with a Clos under the same aggregate load, because the system is less likely to funnel many flows through the same few congested uplinks or spine devices.

However, because the routing intentionally uses many different paths, per-flow packet reordering becomes more likely unless constrained by per-flow hashing or waypointing, which can show up as effective jitter at higher layers. AWS’s description of Spraypoint suggests they mitigate this by using waypoints and policy to preserve some path structure, so you get the diversity benefits without unconstrained per-packet spraying.

Under failure and high load:

Where quasi-random really helps latency/jitter is under failure and partial congestion. In a Clos, link or spine failures can force large sets of flows to converge on a smaller subset of remaining equal-cost paths, driving up queueing delay and jitter nonlinearly. In a resilient random-graph-style fabric, node/edge failures simply remove a few edges from a highly connected graph; there are typically many alternative short paths, so the increase in hop count and queueing pressure is smaller and more diffuse. That tends to keep tail latency and jitter (P99, P99.9) better behaved, even if median latency looks similar to a Clos at low load.

So, qualitatively: median latency is roughly comparable to a well-designed Clos, sometimes better due to fewer active stages; jitter and tail latency should improve under realistic, bursty load and failure scenarios, provided the routing stack is designed to limit packet reordering.

Summary and Conclusions:

Quasi-random data center topologies like AWS’s RNG fabric replace rigid Clos/fat-tree hierarchies with a flatter, graph-like network that preserves short path lengths while dramatically increasing path diversity, which tends to hold median latency roughly steady or slightly better by reducing the number of active, queueing devices per path and offloading fanout to passive optics. They primarily improve jitter and tail latency by spreading flows across many alternative routes so congestion is less concentrated, making queueing delays less bursty and keeping P99/P99.9 behavior more stable under failures and hot spots, provided the routing layer (for example, AWS’s Spraypoint approach) constrains packet reordering through way pointing or per-flow consistency.

In conclusion, quasi-random fabrics are less about shaving a few microseconds off baseline latency and more about delivering more predictable end-to-end performance—especially for east–west, latency-sensitive cloud and AI workloads—by trading rigid structure for statistically robust, highly connected graphs that degrade more gracefully when links, nodes, or traffic patterns become pathological.

…………………………………………………………………………………………………………………………………………………………………….

References:

https://arxiv.org/pdf/2604.15261

https://www.wired.com/story/amazon-thinks-the-future-of-data-centers-depends-on-a-technical-problem-it-just-solved/

https://www.wired.com/story/amazon-aws-ceo-matt-garman-ai-agents/

2 thoughts on “Inside Amazon’s new data center network architecture: quasi random network topology and passive optical devices”

IEEE Member says:

June 1, 2026 at 18:18

According to their research paper, “RNG: Flat Datacenter Networks at Scale,” the infrastructure relies on three core innovations:

1. ShuffleBox Hardware: A custom, passive optical device developed by AWS. It standardizes how switches connect and automatically permutes the fiber links internally. When multiple ShuffleBoxes are connected, they create a global quasi-random topology without requiring custom manual cabling.

2. Spraypoint Routing Protocol: Traditional routing cannot handle flat, random topologies because commodity switches lack the necessary memory. AWS developed a Layer 3 routing method called Spraypoint, which “sprays” data packets randomly to neighboring routers first, then directs them via preselected “waypoints” using standard shortest-path routing. This opens nearly double the independent paths of traditional networks to bypass congestion.

3. High-Density Fiber Connectors: Custom connectors designed to handle massive optical fanouts efficiently across the data center floor.

Quantifiable Operational Gains: By flattening the network and removing traditional hierarchical layers, AWS reports massive efficiency leaps compared to traditional DCN designs.

–>It’s amazing that hyperscalers like AWS and Google design and build every aspect of their IT infrastructure- from chips, to circuit boards, racks, enclosures, cabling/connectors, etc!

Eula Griffin says:

July 14, 2026 at 04:48

This was a fascinating read. I had no idea how much thought goes into data center network design, and the explanation of the quasi-random topology and ShuffleBox made a complex topic much easier to understand. It’s impressive to see how Amazon’s different approach to data center networking can improve performance while also reducing hardware and power consumption.

IEEE ComSoc Technology Blog

Inside Amazon’s new data center network architecture: quasi random network topology and passive optical devices

References:

AWS to deploy AI inference chips from Cerebras in its data centers; Anapurna Labs/Amazon in-house AI silicon products

Amazon’s Jeff Bezos at Italian Tech Week: “AI is a kind of industrial bubble”

Data Center Networking Market to grow at a CAGR of 6.22% during 2022-2027 to reach $35.6 billion by 2027

TMR: Data Center Networking Market sees shift to user-centric & data-oriented business + CoreSite DC Tour

2 thoughts on “Inside Amazon’s new data center network architecture: quasi random network topology and passive optical devices”

Leave a Reply Cancel Reply

Archives

Archives

Recent Posts