Networking chips and modules for AI data centers: Infiniband, Ultra Ethernet, Optical Connections

A growing portion of the billions of dollars being spent on AI data centers will go to the suppliers of networking chips, lasers, and switches that integrate thousands of GPUs and conventional micro-processors into a single AI computer cluster. AI can’t advance without advanced networks, says Nvidia’s networking chief Gilad Shainer. “The network is the most important element because it determines the way the data center will behave.”

Networking chips now account for just 5% to 10% of all AI chip spending, said Broadcom CEO Hock Tan. As the size of AI server clusters hits 500,000 or a million processors, Tan expects that networking will become 15% to 20% of a data center’s chip budget. A data center with a million or more processors will cost $100 billion to build.

The firms building the biggest AI clusters are the hyperscalers, led by Alphabet’s Google, Amazon.com, Facebook parent Meta Platforms, and Microsoft. Not far behind are Oracle, xAI, Alibaba Group Holding, and ByteDance. Earlier this month, Bloomberg reported that capex for those four hyperscalers would exceed $200 billion this year, making the year-over-year increase as much as 50%. Goldman Sachs estimates that AI data center spending will rise another 35% to 40% in 2025.  Morgan Stanley expects Amazon and Microsoft to lead the pack with $96.4bn and $89.9bn of capex respectively, while Google and Meta will follow at $62.6bn and $52.3bn.

AI compute server architectures began scaling in recent years for two reasons.

1.] High end processor chips from Intel neared the end of speed gains made possible by shrinking a chip’s transistors.

2.] Computer scientists at companies such as Google and OpenAI built AI models that performed amazing feats by finding connections within large volumes of training material.

As the components of these “Large Language Models” (LLMs) grew to millions, billions, and then trillions, they began translating languages, doing college homework, handling customer support, and designing cancer drugs. But training an AI LLM is a huge task, as it calculates across billions of data points, rolls those results into new calculations, then repeats. Even with Nvidia accelerator chips to speed up those calculations, the workload has to be distributed across thousands of Nvidia processors and run for weeks.

To keep up with the distributed computing challenge, AI data centers all have two networks:

  1. The “front end” network which sends and receives data to/from  external users —like the networks of every enterprise data center or cloud-computing center. It’s placed on the network’s outward-facing front end or boundary and typically includes equipment like high end routers, web servers, DNS servers, application servers, load balancers, firewalls, and other devices which connect to the public internet, IP-MPLS VPNs and private lines.
  2. A “back end” network that connects every AI processor (GPUs and conventional MPUs) and memory chip with every other processor within the AI data center. “It’s just a supercomputer made of many small processors,” says Ram Velaga, Broadcom’s chief of core switching silicon. “All of these processors have to talk to each other as if they are directly connected.”  AI’s back-end networks need high bandwidth switches and network connections. Delays and congestion are expensive when each Nvidia compute node costs as much as $400,000. Idle processors waste money. Back-end networks carry huge volumes of data. When thousands of processors are exchanging results, the data crossing one of these networks in a second can equal all of the internet traffic in America.

Nvidia became one of today’s largest vendors of network gear via its acquisition of Israel based Mellanox in 2020 for $6.9 billion. CEO Jensen Huang and his colleagues realized early on that AI workloads would exceed a single box. They started using InfiniBand—a network designed for scientific supercomputers—supplied by Mellanox. InfiniBand became the standard for AI back-end networks.

While most AI dollars still go to Nvidia GPU accelerator chips, back-end networks are important enough that Nvidia has large networking sales. In the September quarter, those network sales grew 20%, to $3.1 billion. However, Ethernet is now challenging InfiniBand’s lock on AI networks.  Fortunately for Nvidia, its Mellanox subsidiary also makes high speed Ethernet hardware modules. For example, xAI uses Nvidia Ethernet products in its record-size Colossus system.

While current versions of Ethernet lack InfiniBand’s tools for memory and traffic management, those are now being added in a version called Ultra Ethernet [1.]. Many hyperscalers think Ethernet will outperform InfiniBand, as clusters scale to hundreds of thousands of processors. Another attraction is that Ethernet has many competing suppliers.  “All the largest guys—with an exception of Microsoft—have moved over to Ethernet,” says an anonymous network industry executive. “And even Microsoft has said that by summer of next year, they’ll move over to Ethernet, too.”

Note 1.  Primary goals and mission of Ultra Ethernet Consortium (UEC):  Deliver a complete architecture that optimizes Ethernet for high performance AI and HPC networking, exceeding the performance of today’s specialized technologies. UEC specifically focuses on functionality, performance, TCO, and developer and end-user friendliness, while minimizing changes to only those required and maintaining Ethernet interoperability. Additional goals: Improved bandwidth, latency, tail latency, and scale, matching tomorrow’s workloads and compute architectures. Backwards compatibility to widely-deployed APIs and definition of new APIs that are better optimized to future workloads and compute architectures.

……………………………………………………………………………………………………………………………………………………………………………………………………………………………….

Ethernet back-end networks offer a big opportunity for Arista Networks, which builds switches using Broadcom chips. In the past two years, AI data centers became an important business for Arista.  AI provides sales to Arista switch rivals Cisco and Juniper Networks (soon to be a part of Hewlett Packard Enterprise), but those companies aren’t as established among hyperscalers. Analysts expect Arista to get more than $1 billion from AI sales next year and predict that the total market for back-end switches could reach $15 billion in a few years. Three of the five big hyperscale operators are using Arista Ethernet switches in back-end networks, and the other two are testing them. Arista CEO Jayshree Ullal (a former SCU EECS grad student of this author/x-adjunct Professor) says that back-end network sales seem to pull along more orders for front-end gear, too.

The network chips used for AI switching are feats of engineering that rival AI processor chips. Cisco makes its own custom Ethernet switching chips, but some 80% of the chips used in other Ethernet switches comes from Broadcom, with the rest supplied mainly by Marvell. These switch chips now move 51 terabits of data a second; it’s the same amount of data that a person would consume by watching videos for 200 days straight. Next year, switching speeds will double.

The other important parts of a network are connections between computing nodes and cables. As the processor count rises, connections increase at a faster rate. A 25,000-processor cluster needs 75,000 interconnects. A million processors will need 10 million interconnects.  More of those connections will be fiber optic, instead of copper or coax.  As networks speed up, copper’s reach shrinks. So, expanding clusters have to “scale-out” by linking their racks with optics. “Once you move beyond a few tens of thousand, or 100,000, processors, you cannot connect anything with copper—you have to connect them with optics,” Velaga says.

AI processing chips (GPUs) exchange data at about 10 times the rate of a general-purpose processor chip. Copper has been the preferred conduit because it’s reliable and requires no extra power. At current network speeds, copper works well at lengths of up to five meters. So, hyperscalers have tried to “scale-up” within copper’s reach by packing as many processors as they can within each shelf, and rack of shelves.

Back-end connections now run at 400 gigabits per second, which is equal to a day and half of video viewing. Broadcom’s Velaga says network speeds will rise to 800 gigabits in 2025, and 1.6 terabits in 2026.

Nvidia, Broadcom, and Marvell sell optical interface products, with Marvell enjoying a strong lead in 800-gigabit interconnects. A number of companies supply lasers for optical interconnects, including Coherent, Lumentum Holdings, Applied Optoelectronics, and Chinese vendors Innolight and Eoptolink. They will all battle for the AI data center over the next few years.

A 500,000-processor cluster needs at least 750 megawatts, enough to power 500,000 homes. When AI models scale to a million or more processors, they will require gigawatts of power and have to span more than one physical data center, says Velaga.

The opportunity for optical connections reaches beyond the AI data center. That’s because there isn’t enough power.  In September, Marvell, Lumentum, and Coherent demonstrated optical links for data centers as far apart as 300 miles. Nvidia’s next-generation networks will be ready to run a single AI workload across remote locations.

Some worry that AI performance will stop improving as processor counts scale. Nvidia’s Jensen Huang dismissed those concerns on his last conference call, saying that clusters of 100,000 processors or more will just be table stakes with Nvidia’s next generation of chips.  Broadcom’s Velaga says he is grateful: “Jensen (Nvidia CEO) has created this massive opportunity for all of us.”

References:

https://www.barrons.com/articles/ai-networking-nvidia-cisco-broadcom-arista-bce88c76?mod=hp_WIND_B_1_1  (PAYWALL)

https://www.msn.com/en-us/news/technology/networking-companies-ride-the-ai-wave-it-isn-t-just-nvidia/ar-AA1wJXGa?ocid=BingNewsSerp

https://www.datacenterdynamics.com/en/news/morgan-stanley-hyperscaler-capex-to-reach-300bn-in-2025/

https://ultraethernet.org/ultra-ethernet-specification-update/

Will AI clusters be interconnected via Infiniband or Ethernet: NVIDIA doesn’t care, but Broadcom sure does!

Will billions of dollars big tech is spending on Gen AI data centers produce a decent ROI?

Canalys & Gartner: AI investments drive growth in cloud infrastructure spending

AI Echo Chamber: “Upstream AI” companies huge spending fuels profit growth for “Downstream AI” firms

AI wave stimulates big tech spending and strong profits, but for how long?

Markets and Markets: Global AI in Networks market worth $10.9 billion in 2024; projected to reach $46.8 billion by 2029

Using a distributed synchronized fabric for parallel computing workloads- Part I

 

Using a distributed synchronized fabric for parallel computing workloads- Part II

One thought on “Networking chips and modules for AI data centers: Infiniband, Ultra Ethernet, Optical Connections

  1. While the Ultra Ethernet Consortium (UEC) is collaborating with the IEEE 802.3 working group (WG) and has a liaison relationship with them, the “Ultra Ethernet” standard is NOT currently being standardized by the IEEE 802.3 WG. Instead, the UEC is developing its own specifications focused on optimizing Ethernet for high-performance AI and HPC workloads, building upon existing IEEE 802.3 standards at the physical layer.
    Current IEEE 802.3 standards address high-speed Ethernet options, but the UEC’s focus on “Ultra Ethernet” indicates a potential need for even higher data rates in the future, which could be addressed by future revisions of the 802.3 standard.

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*