Superclusters of Nvidia GPU/AI chips combined with end-to-end network platforms to create next generation data centers

Meta Platforms and Elon Musk’s xAI start-up are among companies building clusters of computer servers with as many as 100,000 of Nvidia’s most advanced GPU chips as the race for artificial-intelligence (AI) supremacy accelerates.

  • Meta Chief Executive Mark Zuckerberg said last month that his company was already training its most advanced AI models with a conglomeration of chips he called “bigger than anything I’ve seen reported for what others are doing.”
  • xAI built a supercomputer called Colossus—with 100,000 of Nvidia’s Hopper GPU/AI chips—in Memphis, TN in a matter of months.
  • OpenAI and Microsoft have been working to build up significant new computing facilities for AI. Google is building massive data centers to house chips that drive its AI strategy.

xAI built a supercomputer in Memphis that it calls Colossus, with 100,000 Nvidia AI chips. Photo: Karen Pulfer Focht/Reuters

A year ago, clusters of tens of thousands of GPU chips were seen as very large. OpenAI used around 10,000 of Nvidia’s chips to train the version of ChatGPT it launched in late 2022, UBS analysts estimate. Installing many GPUs in one location, linked together by superfast networking equipment and cables, has so far produced larger AI models at faster rates. But there are questions about whether ever-bigger super clusters will continue to translate into smarter chatbots and more convincing image-generation tools.

Nvidia Chief Executive Jensen Huang  said that while the biggest clusters for training for giant AI models now top out at around 100,000 of Nvidia’s current chips, “the next generation starts at around 100,000 Blackwells. And so that gives you a sense of where the industry is moving. Do we think that we need millions of GPUs? No doubt. That is a certainty now. And the question is how do we architect it from a data center perspective,” Huang added.

“There is no evidence that this will scale to a million chips and a $100 billion system, but there is the observation that they have scaled extremely well all the way from just dozens of chips to 100,000,” said Dylan Patel, the chief analyst at SemiAnalysis, a market research firm.

Giant super clusters are already getting built. Musk posted last month on his social-media platform X that his 100,000-chip Colossus super cluster was “soon to become” a 200,000-chip cluster in a single building. He also posted in June that the next step would probably be a 300,000-chip cluster of Nvidia’s newest GPU chips next summer.  The rise of super clusters comes as their operators prepare for Nvidia’s nexgen Blackwell chips, which are set to start shipping out in the next couple of months. Blackwell chips are estimated to cost around $30,000 each, meaning a cluster of 100,000 would cost $3 billion, not counting the price of the power-generation infrastructure and IT equipment around the chips.

Those dollar figures make building up super clusters with ever more chips something of a gamble, industry insiders say, given that it isn’t clear that they will improve AI models to a degree that justifies their cost. Indeed, new engineering challenges also often arise with larger clusters:

  • Meta researchers said in a July paper that a cluster of more than 16,000 of Nvidia’s GPUs suffered from unexpected failures of chips and other components routinely as the company trained an advanced version of its Llama model over 54 days.
  • Keeping Nvidia’s chips cool is a major challenge as clusters of power-hungry chips become packed more closely together, industry executives say, part of the reason there is a shift toward liquid cooling where refrigerant is piped directly to chips to keep them from overheating.
  • The sheer size of the super clusters requires a stepped-up level of management of those chips when they fail. Mark Adams, chief executive of Penguin Solutions, a company that helps set up and operate computing infrastructure, said elevated complexity in running large clusters of chips inevitably throws up problems.

The continuation of the AI boom for Nvidia largely depends on how the largest clusters of GPU chips deliver a return on investment for its customers. The trend also fosters demand for Nvidia’s networking equipment, which is fast becoming a significant business. Nvidia’s networking equipment revenue in 2024 was $3.13 billion, which was a 51.8% increase from the previous year.  Mostly from its Mellanox acquisition, Nvidia offers these networking platforms:

  • Accelerated Ethernet Switching for AI and the Cloud

  • Quantum InfiniBand for AI and Scientific Computing

  • Bluefield® Network Accelerators

………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………..

Nvidia forecasts total fiscal fourth-quarter sales of about $37.5bn, up 70%. That was above average analyst projections of $37.1bn, compiled by Bloomberg, but below some projections that were as high as $41bn. “Demand for Hopper and anticipation for Blackwell – in full production – are incredible as foundation model makers scale pretraining, post-training and inference, Huang said.  “Both Hopper and Blackwell systems have certain supply constraints, and the demand for Blackwell is expected to exceed supply for several quarters in fiscal 2026,” CFO Colette Kress said.

References:

https://www.wsj.com/tech/ai/nvidia-chips-ai-race-96d21d09?mod=tech_lead_pos5

https://www.datacenterdynamics.com/en/news/nvidias-data-center-revenue-up-112-over-last-year-as-ai-boom-continues/

https://www.nvidia.com/en-us/networking/

https://nvidianews.nvidia.com/news/nvidia-announces-financial-results-for-third-quarter-fiscal-2025

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*