Superclusters of Nvidia GPU/AI chips combined with end-to-end network platforms to create next generation data centers

Posted on November 25, 2024 by Alan Weissberger

Meta Platforms and Elon Musk’s xAI start-up are among companies building clusters of computer servers with as many as 100,000 of Nvidia’s most advanced GPU chips as the race for artificial-intelligence (AI) supremacy accelerates.

Meta Chief Executive Mark Zuckerberg said last month that his company was already training its most advanced AI models with a conglomeration of chips he called “bigger than anything I’ve seen reported for what others are doing.”
xAI built a supercomputer called Colossus—with 100,000 of Nvidia’s Hopper GPU/AI chips—in Memphis, TN in a matter of months.
OpenAI and Microsoft have been working to build up significant new computing facilities for AI. Google is building massive data centers to house chips that drive its AI strategy.

xAI built a supercomputer in Memphis that it calls Colossus, with 100,000 Nvidia AI chips. Photo: Karen Pulfer Focht/Reuters

A year ago, clusters of tens of thousands of GPU chips were seen as very large. OpenAI used around 10,000 of Nvidia’s chips to train the version of ChatGPT it launched in late 2022, UBS analysts estimate. Installing many GPUs in one location, linked together by superfast networking equipment and cables, has so far produced larger AI models at faster rates. But there are questions about whether ever-bigger super clusters will continue to translate into smarter chatbots and more convincing image-generation tools.

Nvidia Chief Executive Jensen Huang said that while the biggest clusters for training for giant AI models now top out at around 100,000 of Nvidia’s current chips, “the next generation starts at around 100,000 Blackwells. And so that gives you a sense of where the industry is moving. Do we think that we need millions of GPUs? No doubt. That is a certainty now. And the question is how do we architect it from a data center perspective,” Huang added.

“There is no evidence that this will scale to a million chips and a $100 billion system, but there is the observation that they have scaled extremely well all the way from just dozens of chips to 100,000,” said Dylan Patel, the chief analyst at SemiAnalysis, a market research firm.

Giant super clusters are already getting built. Musk posted last month on his social-media platform X that his 100,000-chip Colossus super cluster was “soon to become” a 200,000-chip cluster in a single building. He also posted in June that the next step would probably be a 300,000-chip cluster of Nvidia’s newest GPU chips next summer. The rise of super clusters comes as their operators prepare for Nvidia’s nexgen Blackwell chips, which are set to start shipping out in the next couple of months. Blackwell chips are estimated to cost around $30,000 each, meaning a cluster of 100,000 would cost $3 billion, not counting the price of the power-generation infrastructure and IT equipment around the chips.

Those dollar figures make building up super clusters with ever more chips something of a gamble, industry insiders say, given that it isn’t clear that they will improve AI models to a degree that justifies their cost. Indeed, new engineering challenges also often arise with larger clusters:

Meta researchers said in a July paper that a cluster of more than 16,000 of Nvidia’s GPUs suffered from unexpected failures of chips and other components routinely as the company trained an advanced version of its Llama model over 54 days.
Keeping Nvidia’s chips cool is a major challenge as clusters of power-hungry chips become packed more closely together, industry executives say, part of the reason there is a shift toward liquid cooling where refrigerant is piped directly to chips to keep them from overheating.
The sheer size of the super clusters requires a stepped-up level of management of those chips when they fail. Mark Adams, chief executive of Penguin Solutions, a company that helps set up and operate computing infrastructure, said elevated complexity in running large clusters of chips inevitably throws up problems.

The continuation of the AI boom for Nvidia largely depends on how the largest clusters of GPU chips deliver a return on investment for its customers. The trend also fosters demand for Nvidia’s networking equipment, which is fast becoming a significant business. Nvidia’s networking equipment revenue in 2024 was $3.13 billion, which was a 51.8% increase from the previous year. Mostly from its Mellanox acquisition, Nvidia offers these networking platforms:

Accelerated Ethernet Switching for AI and the Cloud

Quantum InfiniBand for AI and Scientific Computing

Bluefield® Network Accelerators

………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………..

Nvidia forecasts total fiscal fourth-quarter sales of about $37.5bn, up 70%. That was above average analyst projections of $37.1bn, compiled by Bloomberg, but below some projections that were as high as $41bn. “Demand for Hopper and anticipation for Blackwell – in full production – are incredible as foundation model makers scale pretraining, post-training and inference, Huang said. “Both Hopper and Blackwell systems have certain supply constraints, and the demand for Blackwell is expected to exceed supply for several quarters in fiscal 2026,” CFO Colette Kress said.

References:

https://www.wsj.com/tech/ai/nvidia-chips-ai-race-96d21d09?mod=tech_lead_pos5

https://www.datacenterdynamics.com/en/news/nvidias-data-center-revenue-up-112-over-last-year-as-ai-boom-continues/

https://www.nvidia.com/en-us/networking/

https://nvidianews.nvidia.com/news/nvidia-announces-financial-results-for-third-quarter-fiscal-2025

12 thoughts on “Superclusters of Nvidia GPU/AI chips combined with end-to-end network platforms to create next generation data centers”

Anonymous says:

November 25, 2024 at 19:39

Nvidia on Monday showed a new artificial intelligence model for generating music and audio that can modify voices and generate novel sounds – technology aimed at the producers of music, films and video games.
Nvidia, the world’s biggest supplier of chips and software used to create AI systems, said it does not have immediate plans to publicly release the technology, which it calls Fugatto, short for Foundational Generative Audio Transformer Opus 1.

It joins other technologies shown by startups such as Runway and larger players such as Meta Platforms that can generate audio or video from a text prompt. Santa Clara, Ca based Nvidia’s version generates sound effects and music from a text description, including novel sounds such as making a trumpet bark like a dog. What makes it different from other AI technologies is its ability to take in and modify existing audio, for example by taking a line played on a piano and transforming it into a line sung by a human voice, or by taking a spoken word recording and changing the accent used and the mood expressed.

“If we think about synthetic audio over the past 50 years, music sounds different now because of computers, because of synthesizers,” said Bryan Catanzaro, vice president of applied deep learning research at Nvidia. “I think that generative AI is going to bring new capabilities to music, to video games and to ordinary folks that want to create things.”

While companies such as OpenAI are negotiating with Hollywood studios over whether and how the AI could be used in the entertainment industry, the relationship between tech and Hollywood has become tense, particularly after Hollywood star Scarlett Johansson accused OpenAI of imitating her voice.

Nvidia’s new model was trained on open-source data, and the company said it is still debating whether and how to release it publicly.

“Any generative technology always carries some risks, because people might use that to generate things that we would prefer they don’t,” Catanzaro said. “We need to be careful about that, which is why we don’t have immediate plans to release this.”

OpenAI and Meta similarly have not said when they plan to release to the public their models that generate audio or video. Creators of generative AI models have yet to determine how to prevent abuse of the technology such as a user generating misinformation or infringing on copyrights by generating copyrighted characters.

https://www.reuters.com/technology/artificial-intelligence/nvidia-shows-ai-model-that-can-modify-voices-generate-novel-sounds-2024-11-25/
Network Cabling says:

November 27, 2024 at 18:25

The article highlights the rapid advancement in AI infrastructure, with companies like Meta, xAI, and OpenAI building massive superclusters of Nvidia GPUs to accelerate AI model training. These superclusters, sometimes numbering in the hundreds of thousands of GPUs, present both tremendous opportunities for AI innovation and significant challenges, including system management and power requirements. The piece underscores the importance of these superclusters in scaling AI models while also acknowledging the engineering hurdles and the high costs involved. The future of AI depends on how effectively these challenges are managed.
1. Alan Weissberger says:
  
  November 27, 2024 at 19:25
  
  Thanks for your cogent comment. 100% agree that “the future of AI depends on how effectively these challenges are managed.”
Maurizio Dècina says:

November 28, 2024 at 05:18

I would like to underline that at the heart of the Spectrum-X platform is the Spectrum-X Ethernet switching device for 100.000 ports, which supports port speeds of up to 800Gbit/s.
Spectrum-X Ethernet networking for AI delivers high bandwidth with low latency. Features include adaptive routing with NVIDIA Direct Data Placement technology, congestion control, as well as enhanced AI fabric visibility and performance isolation.
A big step forward for the next generation of AI data centers.
1. Alan Weissberger says:
  
  November 28, 2024 at 21:45
  
  Many thanks for your incisive comment Maurizio. I really miss not seeing you and will try to visit you in Milano in Spring 2025!
Maurizio Decina says:

December 3, 2024 at 06:39

Miss you too Alan, hope to meet you again soon!
Network Cabling says:

December 11, 2024 at 18:51

This article delves into the exciting world of superclusters powered by Nvidia GPUs, highlighting the race for AI supremacy. It’s fascinating to see companies like Meta and xAI investing heavily in these massive computing systems, pushing the boundaries of AI development. However, the article also acknowledges the engineering challenges and costs associated with these superclusters, making it a thought-provoking read about the future of AI infrastructure.
Anonymous says:

December 24, 2024 at 18:15

Elon Musk’s generative AI startup xAI has raised $6 billion in a Series C funding round. Investors include A16Z, Blackrock, Fidelity Management & Research Company, Kingdom Holdings, Lightspeed, MGX, Morgan Stanley, OIA, QIA, Sequoia Capital, Valor Equity Partners and Vy Capital, amongst others. GPU companies Nvidia and AMD also participated in the round.

The company last raised $6bn in May, and this summer launched its supercomputer in Memphis with up to 100,000 Nvidia H100 GPUs. Earlier this month, the Memphis Chamber of Commerce claimed that Musk eventually planned to expand the Colossus supercomputer to some one million GPUs.

The company pitches itself as an alternative to OpenAI, which Musk is currently suing for deviating from its non-profit roots. Before he broke away from OpenAI, Musk had pitched making the company a division of for-profit Tesla.

Musk has called OpenAI’s ChatGPT too “woke” and “politically correct,” and claimed that xAI’s Grok is “maximally truth-seeking.”

The company was founded in March 2023 and has grown rapidly since, with its products integrated deeply into Musk’s Twitter/X, and trained on the social media platform’s user-generated data.

https://www.datacenterdynamics.com/en/news/xai-raises-another-6bn-including-from-nvidia-and-amd/
Alan Weissberger says:

March 13, 2025 at 19:29

Nvidia could succeed where open RAN has mostly failed. In the early days of the O-RAN Alliance, the technology was heralded as a way for operators to break big vendors’ lock on expensive RAN components. But today most agree that open RAN hasn’t done much to upend the global RAN order – Ericsson, Nokia, Huawei and Samsung still sit at the top of the market.

“The concept of open and interoperable interfaces will live on in some form of incarnation, but the original vision is no longer viable,” wrote Chetan Sharma, an independent analyst, on social media.

https://www.lightreading.com/ai-machine-learning/nvidia-the-vendor-that-must-not-be-named
Network Cabling says:

March 25, 2025 at 16:30

The article discusses the development of massive superclusters combining NVIDIA’s GPU/AI chips with comprehensive network platforms to create next-generation data centers. Companies like Meta Platforms and Elon Musk’s xAI are building data centers with clusters with up to 100,000 of NVIDIA’s advanced GPU chips to enhance AI capabilities. However, scaling to such large clusters presents challenges, including unexpected hardware failures and power cooling requirements. The article highlights the need for innovative solutions to manage these complexities as AI infrastructure continues to expand.
Anonymous says:

July 10, 2025 at 19:59

Amazon Building Huge, Super Sized AI Data Centers
A year ago, a 1,200-acre stretch of farmland outside New Carlisle, Ind., was an empty cornfield. Now, seven Amazon data centers rise up from the rich soil, each larger than a football stadium.

Over the next several years, Amazon plans to build around 30 data centers at the site, packed with hundreds of thousands of specialized computer chips. With hundreds of thousands of miles of fiber connecting every chip and computer together, the entire complex will form one giant machine intended just for artificial intelligence.

The facility will consume 2.2 gigawatts of electricity — enough to power a million homes. Each year, it will use millions of gallons of water to keep the chips from overheating. And it was built with a single customer in mind: the A.I. start-up Anthropic, which aims to create an A.I. system that matches the human brain.

The complex — so large that it can be viewed completely only from high in the sky — is the first in a new generation of data centers being built by Amazon, and part of what the company calls Project Rainier, after the mountain that looms near its Seattle headquarters. Project Rainier will also include facilities in Mississippi and possibly other locations, like North Carolina and Pennsylvania.

Advertisement

SKIP ADVERTISEMENT

Project Rainier is Amazon’s entry into a race by the technology industry to build data centers so large they would have been considered absurd just a few years ago. Meta, which owns Facebook, Instagram and WhatsApp, is building a two-gigawatt data center in Louisiana. OpenAI is erecting a 1.2-gigawatt facility in Texas and another, nearly as large, in the United Arab Emirates.

These data centers will dwarf most of today’s, which were built before OpenAI’s ChatGPT chatbot inspired the A.I. boom in 2022. The tech industry’s increasingly powerful A.I. technologies require massive networks of specialized computer chips — and hundreds of billions of dollars to build the data centers that house those chips. The result: behemoths that stretch the limits of the electrical grid and change the way the world thinks about computers.

Amazon, which has invested $8 billion in Anthropic, will rent computing power from the new facility to its start-up partner. An Anthropic co-founder, Tom Brown, who oversees the company’s work with Amazon on its chips and data centers, said having all that computing power in one spot could allow the start-up to train a single A.I. system.
https://www.nytimes.com/2025/06/24/technology/amazon-ai-data-centers.html?searchResultPosition=1
IEEE Member says:

August 2, 2025 at 20:42

Big Tech companies are engaged in an artificial-intelligence arms race, each building data centers at a blistering rate. All together, the four giants spent $95 billion on capex in the second quarter, and much more than that is in the pipeline.

The companies have been giving annual capex guidance. Microsoft, whose fiscal year ended this quarter, declined to issue a fiscal 2026 projection, but put out a big number—over $30 billion—for the current quarter’s investments. That would see expenses rise by 50% on the year, but Chief Financial Officer Amy Hood cautioned that the growth rate would moderate through the fiscal year.

By contrast, Meta isn’t slowing down. After raising its 2025 capex guidance last quarter and inching it up this past week, Chief Financial Officer Susan Li said on the earnings call that the company “expects to ramp our investments significantly in 2026.” This is exceptional because Meta is the only one from this group that doesn’t operate a cloud to rent out these AI servers; it’s all for its own use.

Amazon is also proceeding full steam ahead. It used almost every penny of its second-quarter operational cash flows for $31 billion of capex, and guided to around $60 billion for the second half, putting it on pace for a stunning $115 billion for the year. Amazon leads the pack here, but unlike the other AI contenders, its number is inflated by large retail investments for warehouses, vehicles, and robots.

Alphabet isn’t slowing down either, raising its 2025 capex guidance considerably. And though Apple spends much less—$3.5 billion in the quarter—that’s still 61% higher than last year.

https://www.barrons.com/articles/takeaways-big-tech-earnings-ai-68402f1a?mod=hp_WIND_A_3_2