Superclusters of Nvidia GPU/AI chips combined with end-to-end network platforms to create next generation data centers

Meta Platforms and Elon Musk’s xAI start-up are among companies building clusters of computer servers with as many as 100,000 of Nvidia’s most advanced GPU chips as the race for artificial-intelligence (AI) supremacy accelerates.

  • Meta Chief Executive Mark Zuckerberg said last month that his company was already training its most advanced AI models with a conglomeration of chips he called “bigger than anything I’ve seen reported for what others are doing.”
  • xAI built a supercomputer called Colossus—with 100,000 of Nvidia’s Hopper GPU/AI chips—in Memphis, TN in a matter of months.
  • OpenAI and Microsoft have been working to build up significant new computing facilities for AI. Google is building massive data centers to house chips that drive its AI strategy.

xAI built a supercomputer in Memphis that it calls Colossus, with 100,000 Nvidia AI chips. Photo: Karen Pulfer Focht/Reuters

A year ago, clusters of tens of thousands of GPU chips were seen as very large. OpenAI used around 10,000 of Nvidia’s chips to train the version of ChatGPT it launched in late 2022, UBS analysts estimate. Installing many GPUs in one location, linked together by superfast networking equipment and cables, has so far produced larger AI models at faster rates. But there are questions about whether ever-bigger super clusters will continue to translate into smarter chatbots and more convincing image-generation tools.

Nvidia Chief Executive Jensen Huang  said that while the biggest clusters for training for giant AI models now top out at around 100,000 of Nvidia’s current chips, “the next generation starts at around 100,000 Blackwells. And so that gives you a sense of where the industry is moving. Do we think that we need millions of GPUs? No doubt. That is a certainty now. And the question is how do we architect it from a data center perspective,” Huang added.

“There is no evidence that this will scale to a million chips and a $100 billion system, but there is the observation that they have scaled extremely well all the way from just dozens of chips to 100,000,” said Dylan Patel, the chief analyst at SemiAnalysis, a market research firm.

Giant super clusters are already getting built. Musk posted last month on his social-media platform X that his 100,000-chip Colossus super cluster was “soon to become” a 200,000-chip cluster in a single building. He also posted in June that the next step would probably be a 300,000-chip cluster of Nvidia’s newest GPU chips next summer.  The rise of super clusters comes as their operators prepare for Nvidia’s nexgen Blackwell chips, which are set to start shipping out in the next couple of months. Blackwell chips are estimated to cost around $30,000 each, meaning a cluster of 100,000 would cost $3 billion, not counting the price of the power-generation infrastructure and IT equipment around the chips.

Those dollar figures make building up super clusters with ever more chips something of a gamble, industry insiders say, given that it isn’t clear that they will improve AI models to a degree that justifies their cost. Indeed, new engineering challenges also often arise with larger clusters:

  • Meta researchers said in a July paper that a cluster of more than 16,000 of Nvidia’s GPUs suffered from unexpected failures of chips and other components routinely as the company trained an advanced version of its Llama model over 54 days.
  • Keeping Nvidia’s chips cool is a major challenge as clusters of power-hungry chips become packed more closely together, industry executives say, part of the reason there is a shift toward liquid cooling where refrigerant is piped directly to chips to keep them from overheating.
  • The sheer size of the super clusters requires a stepped-up level of management of those chips when they fail. Mark Adams, chief executive of Penguin Solutions, a company that helps set up and operate computing infrastructure, said elevated complexity in running large clusters of chips inevitably throws up problems.

The continuation of the AI boom for Nvidia largely depends on how the largest clusters of GPU chips deliver a return on investment for its customers. The trend also fosters demand for Nvidia’s networking equipment, which is fast becoming a significant business. Nvidia’s networking equipment revenue in 2024 was $3.13 billion, which was a 51.8% increase from the previous year.  Mostly from its Mellanox acquisition, Nvidia offers these networking platforms:

  • Accelerated Ethernet Switching for AI and the Cloud

  • Quantum InfiniBand for AI and Scientific Computing

  • Bluefield® Network Accelerators

………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………..

Nvidia forecasts total fiscal fourth-quarter sales of about $37.5bn, up 70%. That was above average analyst projections of $37.1bn, compiled by Bloomberg, but below some projections that were as high as $41bn. “Demand for Hopper and anticipation for Blackwell – in full production – are incredible as foundation model makers scale pretraining, post-training and inference, Huang said.  “Both Hopper and Blackwell systems have certain supply constraints, and the demand for Blackwell is expected to exceed supply for several quarters in fiscal 2026,” CFO Colette Kress said.

References:

https://www.wsj.com/tech/ai/nvidia-chips-ai-race-96d21d09?mod=tech_lead_pos5

https://www.datacenterdynamics.com/en/news/nvidias-data-center-revenue-up-112-over-last-year-as-ai-boom-continues/

https://www.nvidia.com/en-us/networking/

https://nvidianews.nvidia.com/news/nvidia-announces-financial-results-for-third-quarter-fiscal-2025

6 thoughts on “Superclusters of Nvidia GPU/AI chips combined with end-to-end network platforms to create next generation data centers

  1. Nvidia on Monday showed a new artificial intelligence model for generating music and audio that can modify voices and generate novel sounds – technology aimed at the producers of music, films and video games.
    Nvidia, the world’s biggest supplier of chips and software used to create AI systems, said it does not have immediate plans to publicly release the technology, which it calls Fugatto, short for Foundational Generative Audio Transformer Opus 1.

    It joins other technologies shown by startups such as Runway and larger players such as Meta Platforms that can generate audio or video from a text prompt. Santa Clara, Ca based Nvidia’s version generates sound effects and music from a text description, including novel sounds such as making a trumpet bark like a dog. What makes it different from other AI technologies is its ability to take in and modify existing audio, for example by taking a line played on a piano and transforming it into a line sung by a human voice, or by taking a spoken word recording and changing the accent used and the mood expressed.

    “If we think about synthetic audio over the past 50 years, music sounds different now because of computers, because of synthesizers,” said Bryan Catanzaro, vice president of applied deep learning research at Nvidia. “I think that generative AI is going to bring new capabilities to music, to video games and to ordinary folks that want to create things.”

    While companies such as OpenAI are negotiating with Hollywood studios over whether and how the AI could be used in the entertainment industry, the relationship between tech and Hollywood has become tense, particularly after Hollywood star Scarlett Johansson accused OpenAI of imitating her voice.

    Nvidia’s new model was trained on open-source data, and the company said it is still debating whether and how to release it publicly.

    “Any generative technology always carries some risks, because people might use that to generate things that we would prefer they don’t,” Catanzaro said. “We need to be careful about that, which is why we don’t have immediate plans to release this.”

    OpenAI and Meta similarly have not said when they plan to release to the public their models that generate audio or video. Creators of generative AI models have yet to determine how to prevent abuse of the technology such as a user generating misinformation or infringing on copyrights by generating copyrighted characters.

    https://www.reuters.com/technology/artificial-intelligence/nvidia-shows-ai-model-that-can-modify-voices-generate-novel-sounds-2024-11-25/

  2. The article highlights the rapid advancement in AI infrastructure, with companies like Meta, xAI, and OpenAI building massive superclusters of Nvidia GPUs to accelerate AI model training. These superclusters, sometimes numbering in the hundreds of thousands of GPUs, present both tremendous opportunities for AI innovation and significant challenges, including system management and power requirements. The piece underscores the importance of these superclusters in scaling AI models while also acknowledging the engineering hurdles and the high costs involved. The future of AI depends on how effectively these challenges are managed.

    1. Thanks for your cogent comment. 100% agree that “the future of AI depends on how effectively these challenges are managed.”

  3. I would like to underline that at the heart of the Spectrum-X platform is the Spectrum-X Ethernet switching device for 100.000 ports, which supports port speeds of up to 800Gbit/s.
    Spectrum-X Ethernet networking for AI delivers high bandwidth with low latency. Features include adaptive routing with NVIDIA Direct Data Placement technology, congestion control, as well as enhanced AI fabric visibility and performance isolation.
    A big step forward for the next generation of AI data centers.

    1. Many thanks for your incisive comment Maurizio. I really miss not seeing you and will try to visit you in Milano in Spring 2025!

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*