Superclusters of Nvidia GPU/AI chips combined with end-to-end network platforms to create next generation data centers

Meta Platforms and Elon Musk’s xAI start-up are among companies building clusters of computer servers with as many as 100,000 of Nvidia’s most advanced GPU chips as the race for artificial-intelligence (AI) supremacy accelerates.

  • Meta Chief Executive Mark Zuckerberg said last month that his company was already training its most advanced AI models with a conglomeration of chips he called “bigger than anything I’ve seen reported for what others are doing.”
  • xAI built a supercomputer called Colossus—with 100,000 of Nvidia’s Hopper GPU/AI chips—in Memphis, TN in a matter of months.
  • OpenAI and Microsoft have been working to build up significant new computing facilities for AI. Google is building massive data centers to house chips that drive its AI strategy.

xAI built a supercomputer in Memphis that it calls Colossus, with 100,000 Nvidia AI chips. Photo: Karen Pulfer Focht/Reuters

A year ago, clusters of tens of thousands of GPU chips were seen as very large. OpenAI used around 10,000 of Nvidia’s chips to train the version of ChatGPT it launched in late 2022, UBS analysts estimate. Installing many GPUs in one location, linked together by superfast networking equipment and cables, has so far produced larger AI models at faster rates. But there are questions about whether ever-bigger super clusters will continue to translate into smarter chatbots and more convincing image-generation tools.

Nvidia Chief Executive Jensen Huang  said that while the biggest clusters for training for giant AI models now top out at around 100,000 of Nvidia’s current chips, “the next generation starts at around 100,000 Blackwells. And so that gives you a sense of where the industry is moving. Do we think that we need millions of GPUs? No doubt. That is a certainty now. And the question is how do we architect it from a data center perspective,” Huang added.

“There is no evidence that this will scale to a million chips and a $100 billion system, but there is the observation that they have scaled extremely well all the way from just dozens of chips to 100,000,” said Dylan Patel, the chief analyst at SemiAnalysis, a market research firm.

Giant super clusters are already getting built. Musk posted last month on his social-media platform X that his 100,000-chip Colossus super cluster was “soon to become” a 200,000-chip cluster in a single building. He also posted in June that the next step would probably be a 300,000-chip cluster of Nvidia’s newest GPU chips next summer.  The rise of super clusters comes as their operators prepare for Nvidia’s nexgen Blackwell chips, which are set to start shipping out in the next couple of months. Blackwell chips are estimated to cost around $30,000 each, meaning a cluster of 100,000 would cost $3 billion, not counting the price of the power-generation infrastructure and IT equipment around the chips.

Those dollar figures make building up super clusters with ever more chips something of a gamble, industry insiders say, given that it isn’t clear that they will improve AI models to a degree that justifies their cost. Indeed, new engineering challenges also often arise with larger clusters:

  • Meta researchers said in a July paper that a cluster of more than 16,000 of Nvidia’s GPUs suffered from unexpected failures of chips and other components routinely as the company trained an advanced version of its Llama model over 54 days.
  • Keeping Nvidia’s chips cool is a major challenge as clusters of power-hungry chips become packed more closely together, industry executives say, part of the reason there is a shift toward liquid cooling where refrigerant is piped directly to chips to keep them from overheating.
  • The sheer size of the super clusters requires a stepped-up level of management of those chips when they fail. Mark Adams, chief executive of Penguin Solutions, a company that helps set up and operate computing infrastructure, said elevated complexity in running large clusters of chips inevitably throws up problems.

The continuation of the AI boom for Nvidia largely depends on how the largest clusters of GPU chips deliver a return on investment for its customers. The trend also fosters demand for Nvidia’s networking equipment, which is fast becoming a significant business. Nvidia’s networking equipment revenue in 2024 was $3.13 billion, which was a 51.8% increase from the previous year.  Mostly from its Mellanox acquisition, Nvidia offers these networking platforms:

  • Accelerated Ethernet Switching for AI and the Cloud

  • Quantum InfiniBand for AI and Scientific Computing

  • Bluefield® Network Accelerators

………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………..

Nvidia forecasts total fiscal fourth-quarter sales of about $37.5bn, up 70%. That was above average analyst projections of $37.1bn, compiled by Bloomberg, but below some projections that were as high as $41bn. “Demand for Hopper and anticipation for Blackwell – in full production – are incredible as foundation model makers scale pretraining, post-training and inference, Huang said.  “Both Hopper and Blackwell systems have certain supply constraints, and the demand for Blackwell is expected to exceed supply for several quarters in fiscal 2026,” CFO Colette Kress said.

References:

https://www.wsj.com/tech/ai/nvidia-chips-ai-race-96d21d09?mod=tech_lead_pos5

https://www.datacenterdynamics.com/en/news/nvidias-data-center-revenue-up-112-over-last-year-as-ai-boom-continues/

https://www.nvidia.com/en-us/networking/

https://nvidianews.nvidia.com/news/nvidia-announces-financial-results-for-third-quarter-fiscal-2025

FT: New benchmarks for Gen AI models; Neocloud groups leverage Nvidia chips to borrow >$11B

The Financial Times reports that technology companies are rush­ing to redesign how they test and eval­u­ate their Gen AI mod­els, as cur­rent AI bench­marks appear to be inadequate.  AI benchmarks are used to assess how well an AI model can generate content that is coherent, relevant, and creative. This can include generating text, images, music, or any other form of content.

OpenAI, Microsoft, Meta and Anthropic have all recently announced plans to build AI agents that can execute tasks for humans autonom­ously on their behalf. To do this effect­ively, the AI sys­tems must be able to per­form increas­ingly com­plex actions, using reas­on­ing and plan­ning.

Cur­rent pub­lic AI bench­marks — Hel­laswag and MMLU — use mul­tiple-choice ques­tions to assess com­mon sense and know­ledge across vari­ous top­ics. However, research­ers argue this method is now becom­ing redund­ant and mod­els need more com­plex prob­lems.

“We are get­ting to the era where a lot of the human-writ­ten tests are no longer suf­fi­cient as a good baro­meter for how cap­able the mod­els are,” said Mark Chen, senior vice-pres­id­ent of research at OpenAI. “That cre­ates a new chal­lenge for us as a research world.”

The SWE Veri­fied benchmark was updated in August to bet­ter eval­u­ate autonom­ous sys­tems based on feed­back from com­pan­ies, includ­ing OpenAI. It uses real-world soft­ware prob­lems sourced from the developer plat­form Git­Hub and involves sup­ply­ing the AI agent with a code repos­it­ory and an engin­eer­ing issue, ask­ing them to fix it. The tasks require reas­on­ing to com­plete.

“It is a lot more chal­len­ging [with agen­tic sys­tems] because you need to con­nect those sys­tems to lots of extra tools,” said Jared Kaplan, chief sci­ence officer at Anthropic.

“You have to basic­ally cre­ate a whole sand­box envir­on­ment for them to play in. It is not as simple as just provid­ing a prompt, see­ing what the com­ple­tion is and then eval­u­at­ing that.”

Another import­ant factor when con­duct­ing more advanced tests is to make sure the bench­mark ques­tions are kept out of the pub­lic domain, in order to ensure the mod­els do not effect­ively “cheat” by gen­er­at­ing the answers from train­ing data, rather than solv­ing the prob­lem.

The need for new bench­marks has also led to efforts by external organ­iza­tions. In Septem­ber, the start-up Scale AI announced a project called “Human­ity’s Last Exam”, which crowd­sourced com­plex ques­tions from experts across dif­fer­ent dis­cip­lines that required abstract reas­on­ing to com­plete.

Meanwhile, the Financial Times recently reported that Wall Street’s largest financial institutions had loaned more than $11bn to “neocloud” groups, backed by their possession of Nvidia’s AI GPU chips. These companies include names such as CoreWeave, Crusoe and Lambda, and provide cloud computing services to tech businesses building AI products. They have acquired tens of thousands of Nvidia’s graphics processing units (GPUs) through partnerships with the chipmaker. With capital expenditure on data centres surging, in the rush to develop AI models, the Nvidia’s AI GPU chips have become a precious commodity.

Nvidia’s chips have become a precious commodity in the ongoing race to develop AI models © Marlena Sloss/Bloomberg

…………………………………………………………………………………………………………………………………

The $3tn tech group’s allocation of chips to neocloud groups has given confidence to Wall Street lenders to lend billions of dollars to the companies that are then used to buy more Nvidia chips. Nvidia is itself an investor in neocloud companies that in turn are among its largest customers. Critics have questioned the ongoing value of the collateralised chips as new advanced versions come to market — or if the current high spending on AI begins to retract. “The lenders all coming in push the story that you can borrow against these chips and add to the frenzy that you need to get in now,” said Nate Koppikar, a short seller at hedge fund Orso Partners. “But chips are a depreciating, not appreciating, asset.”

References:

https://www.ft.com/content/866ad6e9-f8fe-451f-9b00-cb9f638c7c59

https://www.ft.com/content/fb996508-c4df-4fc8-b3c0-2a638bb96c19

https://www.ft.com/content/41bfacb8-4d1e-4f25-bc60-75bf557f1f21

Tata Consultancy Services: Critical role of Gen AI in 5G; 5G private networks and enterprise use cases

Reuters & Bloomberg: OpenAI to design “inference AI” chip with Broadcom and TSMC

AI adoption to accelerate growth in the $215 billion Data Center market

AI Echo Chamber: “Upstream AI” companies huge spending fuels profit growth for “Downstream AI” firms

AI winner Nvidia faces competition with new super chip delayed

 

Nvidia enters Data Center Ethernet market with its Spectrum-X networking platform

Nvidia is planning a big push into the Data Center Ethernet market. CFO Colette Kress said the Spectrum-X Ethernet-based networking solution it launched in May 2023 is “well on track to begin a multi-billion-dollar product line within a year.”  The Spectrum-X platform includes: Ethernet switches, optics, cables and network interface cards (NICs).  Nvidia already has a multi-billion-dollar play in this space in the form of its Ethernet NIC product.  Kress said during Nvidia’s earnings call that “hundreds of customers have already adopted the platform.” And that Nvidia plans to “launch new Spectrum-X products every year to support demand for scaling compute clusters from tens of thousands of GPUs today to millions of DPUs in the near future.”

  • With Spectrum-X, Nvidia will be competing with Arista, Cisco, and Juniper at the system level along with “bare metal switches” from Taiwanese ODMs running DriveNets network cloud software.
  • With respect to high performance Ethernet switching silicon, Nvidia competitors include Broadcom, Marvell, Microchip, and Cisco (which uses Silicon One internally and also sells it on the merchant semiconductor market).

(Art by Midjourney for Fierce Network

Image by Midjourney for Fierce Network

…………………………………………………………………………………………………………………………………………………………………………..

In November 2023, Nvidia said it would work with Dell Technologies, Hewlett Packard Enterprise and Lenovo to incorporate Spectrum-X capabilities into their compute servers.  Nvidia is now targeting tier-2 cloud service providers and enterprise customers looking for bundled solutions.

Dell’Oro Group VP Sameh Boujelbene told Fierce Network that “Nvidia is positioning Spectrum-X for AI back-end network deployments as an alternative fabric to InfiniBand. While InfiniBand currently dominates AI back-end networks with over 80% market share, Ethernet switches optimized for AI deployments have been gaining ground very quickly.”  Boujelbene added Nvidia’s success with Spectrum-X thus far has largely been driven “by one major 100,000-GPU cluster, along with several smaller deployments by Cloud Service Providers.”  By 2028, Boujelbene said Dell’Oro expects Ethernet switches to surpass InfiniBand for AI in the back-end network market, with revenues exceeding $10 billion.

………………………………………………………………………………………………………………………………………………………………………………

In a recent IEEE Techblog post we wrote:

While InfiniBand currently has the edge in the data center networking market, but several factors point to increased Ethernet adoption for AI clusters in the future. Recent innovations are addressing Ethernet’s shortcomings compared to InfiniBand:

  • Lossless Ethernet technologies
  • RDMA over Converged Ethernet (RoCE)
  • Ultra Ethernet Consortium’s AI-focused specifications

Some real-world tests have shown Ethernet offering up to 10% improvement in job completion performance across all packet sizes compared to InfiniBand in complex AI training tasks.  By 2028, it’s estimated that: 1] 45% of generative AI workloads will run on Ethernet (up from <20% now) and 2] 30% will run on InfiniBand (up from <20% now).

………………………………………………………………………………………………………………………………………………………………………………

References:

https://www.fierce-network.com/cloud/data-center-ethernet-nvidias-next-multi-billion-dollar-business

https://www.nvidia.com/en-us/networking/spectrumx/

https://investor.nvidia.com/news/press-release-details/2023/NVIDIAs-New-Ethernet-Networking-Platform-for-AI-Available-Soon-From-Dell-Technologies-Hewlett-Packard-Enterprise-Lenovo/default.aspx

https://investor.nvidia.com/news/press-release-details/2024/NVIDIA-Announces-Financial-Results-for-Second-Quarter-Fiscal-2025/default.aspx

Will AI clusters be interconnected via Infiniband or Ethernet: NVIDIA doesn’t care, but Broadcom sure does!

Data Center Networking Market to grow at a CAGR of 6.22% during 2022-2027 to reach $35.6 billion by 2027

LightCounting: Optical Ethernet Transceiver sales will increase by 40% in 2024

Will AI clusters be interconnected via Infiniband or Ethernet: NVIDIA doesn’t care, but Broadcom sure does!

InfiniBand, which has been used extensively for HPC interconnect, currently dominates AI networking accounting for about 90% of deployments. That is largely due to its very low latency and architecture that reduces packet loss, which is beneficial for AI training workloads.  Packet loss slows AI training workloads, and they’re already expensive and time-consuming. This is probably why Microsoft chose to run InfiniBand when building out its data centers to support machine learning workloads.  However, InfiniBand tends to lag Ethernet in terms of top speeds. Nvidia’s very latest Quantum InfiniBand switch tops out at 51.2 Tb/s with 400 Gb/s ports. By comparison, Ethernet switching hit 51.2 Tb/s nearly two years ago and can support 800 Gb/s port speeds.

While InfiniBand currently has the edge, several factors point to increased Ethernet adoption for AI clusters in the future. Recent innovations are addressing Ethernet’s shortcomings compared to InfiniBand:

  • Lossless Ethernet technologies
  • RDMA over Converged Ethernet (RoCE)
  • Ultra Ethernet Consortium’s AI-focused specifications

Some real-world tests have shown Ethernet offering up to 10% improvement in job completion performance across all packet sizes compared to InfiniBand in complex AI training tasks.  By 2028, it’s estimated that: 1] 45% of generative AI workloads will run on Ethernet (up from <20% now) and 2] 30% will run on InfiniBand (up from <20% now).

In a lively session at VM Ware-Broadcom’s Explore event, panelists were asked how to best network together the GPUs, and other data center infrastructure, needed to deliver AI. Broadcom’s Ram Velaga, SVP and GM of the Core Switching Group, was unequivocal: “Ethernet will be the technology to make this happen.”  Velaga opening remarks asked the audience, “Think about…what is machine learning and how is that different from cloud computing?” Cloud computing, he said, is about driving utilization of CPUs; with ML, it’s the opposite.

“No one…machine learning workload can run on a single GPU…No single GPU can run an entire machine learning workload. You have to connect many GPUs together…so machine learning is a distributed computing problem. It’s actually the opposite of a cloud computing problem,” Velaga added.

Nvidia (which acquired Israel interconnect fabless chip maker Mellanox [1.] in 2019) says, “Infiniband provides dramatic leaps in performance to achieve faster time to discovery with less cost and complexity.”  Velaga disagrees saying “InfiniBand is expensive, fragile and predicated on the faulty assumption that the physical infrastructure is lossless.”

Note 1. Mellanox specialized in switched fabrics for enterprise data centers and high performance computing, when high data rates and low latency are required such as in a computer cluster.

…………………………………………………………………………………………………………………………………………..

Ethernet, on the other hand, has been the subject of ongoing innovation and advancement since, he cited the following selling points:

  • Pervasive deployment
  • Open and standards-based
  • Highest Remote Direct Access Memory (RDMA) performance for AI fabrics
  • Lowest cost compared to proprietary tech
  • Consistent across front-end, back-end, storage and management networks
  • High availability, reliability and ease of use
  • Broad silicon, hardware, software, automation, monitoring and debugging solutions from a large ecosystem

To that last point, Velaga said, “We steadfastly have been innovating in this world of Ethernet. When there’s so much competition, you have no choice but to innovate.” InfiniBand, he said, is “a road to nowhere.” It should be noted that Broadcom (which now owns VMWare) is the largest supplier of Ethernet switching chips for every part of a service provider network (see diagram below). Broadcom’s Jericho3-AI silicon, which can connect up to 32,000 GPU chips together, competes head-on with InfiniBand!

Image Courtesy of Broadcom

………………………………………………………………………………………………………………………………………………………..

Conclusions:

While InfiniBand currently dominates AI networking, Ethernet is rapidly evolving to meet AI workload demands. The future will likely see a mix of both technologies, with Ethernet gaining significant ground due to its improvements, cost-effectiveness, and widespread compatibility. Organizations will need to evaluate their specific needs, considering factors like performance requirements, existing infrastructure, and long-term scalability when choosing between InfiniBand and Ethernet for AI clusters.

–>Well, it turns out that Nvidia’s Mellanox division in Israel makes BOTH Infiniband AND Ethernet chips so they win either way!

…………………………………………………………………………………………………………………………………………………………………………..

References:

https://www.perplexity.ai/search/will-ai-clusters-run-on-infini-uCYEbRjeR9iKAYH75gz8ZA

https://i0.wp.com/techjunction.co/wp-content/uploads/2023/10/InfiniBand-Topology.png?resize=768%2C420&ssl=1

https://www.theregister.com/2024/01/24/ai_networks_infiniband_vs_ethernet/

Broadcom on AI infrastructure networking—’Ethernet will be the technology to make this happen’

https://www.nvidia.com/en-us/networking/products/infiniband/h

ttps://www.nvidia.com/en-us/networking/products/ethernet/

Part1: Unleashing Network Potentials: Current State and Future Possibilities with AI/ML

Using a distributed synchronized fabric for parallel computing workloads- Part II

Part-2: Unleashing Network Potentials: Current State and Future Possibilities with AI/ML

 

 

 

AI Echo Chamber: “Upstream AI” companies huge spending fuels profit growth for “Downstream AI” firms

According to the Wall Street Journal, the AI industry has become an  “Echo Chamber,” where huge capital spending by the AI infrastructure and application providers have fueled revenue and profit growth for everyone else. Market research firm Bespoke Investment Group has recently created baskets for “downstream” and “upstream” AI companies.

  • The Downstream group involves “AI implementation,” which consist of firms that sell AI development tools, such as the large language models (LLMs) popularized by OpenAI’s ChatGPT since the end of 2022, or run products that can incorporate them. This includes Google/Alphabet, Microsoft, Amazon, Meta Platforms (FB), along with IBM, Adobe and Salesforce.
  • Higher up the supply chain (Upstream group), are the “AI infrastructure” providers, which sell AI chips, applications, data centers and training software. The undisputed leader is Nvidia, which has seen its sales triple in a year, but it also includes other semiconductor companies, database developer Oracle and owners of data centers Equinix and Digital Realty.

The Upstream group of companies have posted profit margins that are far above what analysts expected a year ago. In the second quarter, and pending Nvidia’s results on Aug. 28th , Upstream AI members of the S&P 500 are set to have delivered a 50% annual increase in earnings. For the remainder of 2024, they will be increasingly responsible for the profit growth that Wall Street expects from the stock market—even accounting for Intel’s huge problems and restructuring.

It should be noted that the lines between the two groups can be blurry, particularly when it comes to giants such as Amazon, Microsoft and Alphabet, which  provide both AI implementation (e.g. LLMs) and infrastructure: Their cloud-computing businesses are responsible for turning these companies into the early winners of the AI craze last year and reported breakneck growth during this latest earnings season.  A crucial point is that it is their role as ultimate developers of AI applications that have led them to make super huge capital expenditures, which are responsible for the profit surge in the rest of the ecosystem.  So there is a definite trickle down effect where the big tech  players AI directed CAPEX is boosting revenue and profits for the companies down the supply chain.

As the path for monetizing this technology gets longer and harder, the benefits seem to be increasingly accruing to companies higher up in the supply chain. Meta Platforms Chief Executive Mark Zuckerberg recently said the company’s coming Llama 4 language model will require 10 times as much computing power to train as its predecessor. Were it not for AI, revenues for semiconductor firms would probably have fallen during the second quarter, rather than rise 18%, according to S&P Global.

………………………………………………………………………………………………………………………………………………………..

paper written by researchers from the likes of Cambridge and Oxford uncovered that the large language models (LLMs) behind some of today’s most exciting AI apps may have been trained on “synthetic data” or data generated by other AI. This revelation raises ethical and quality concerns. If an AI model is trained primarily or even partially on synthetic data, it might produce outputs lacking human-generated content’s richness and reliability. It could be a case of the blind leading the blind, with AI models reinforcing the limitations or biases inherent in the synthetic data they were trained on.

In this paper, the team coined the phrase “model collapse,” claiming that training models this way will answer user prompts with low-quality outputs. The idea of “model collapse” suggests a sort of unraveling of the machine’s learning capabilities, where it fails to produce outputs with the informative or nuanced characteristics we expect. This poses a serious question for the future of AI development. If AI is increasingly trained on synthetic data, we risk creating echo chambers of misinformation or low-quality responses, leading to less helpful and potentially even misleading systems.

……………………………………………………………………………………………………………………………………………

In a recent working paper, Massachusetts Institute of Technology (MIT) economist Daron Acemoglu argued that AI’s knack for easy tasks has led to exaggerated predictions of its power to enhance productivity in hard jobs. Also, some of the new tasks created by AI may have negative social value (such as design of algorithms for online manipulation).  Indeed, data from the Census Bureau show that only a small percentage of U.S. companies outside of the information and knowledge sectors are looking to make use of AI.

References:

https://www.wsj.com/tech/ai/the-big-risk-for-the-market-becoming-an-ai-echo-chamber-e8977de0?mod=tech_lead_pos4

https://deepgram.com/learn/the-ai-echo-chamber-model-collapse-synthetic-data-risks

https://economics.mit.edu/sites/default/files/2024-04/The%20Simple%20Macroeconomics%20of%20AI.pdf

AI wave stimulates big tech spending and strong profits, but for how long?

AI winner Nvidia faces competition with new super chip delayed

SK Telecom and Singtel partner to develop next-generation telco technologies using AI

Telecom and AI Status in the EU

Vodafone: GenAI overhyped, will spend $151M to enhance its chatbot with AI

Data infrastructure software: picks and shovels for AI; Hyperscaler CAPEX

AI winner Nvidia faces competition with new super chip delayed

The Clear AI Winner Is: Nvidia!

Strong AI spending should help Nvidia make its own ambitious numbers when it reports earnings at the end of the month (it’s 2Q-2024 ended July 31st). Analysts are expecting nearly $25 billion in data center revenue for the July quarter—about what that business was generating annually a year ago. But the latest results won’t quell the growing concern investors have with the pace of AI spending among the world’s largest tech giants—and how it will eventually pay off.

In March, Nvidia unveiled its Blackwell chip series, succeeding its earlier flagship AI chip, the GH200 Grace Hopper Superchip, which was designed to speed generative AI applications.  The NVIDIA GH200 NVL2 fully connects two GH200 Superchips with NVLink, delivering up to 288GB of high-bandwidth memory, 10 terabytes per second (TB/s) of memory bandwidth, and 1.2TB of fast memory. The GH200 NVL2 offers up to 3.5X more GPU memory capacity and 3X more bandwidth than the NVIDIA H100 Tensor Core GPU in a single server for compute- and memory-intensive workloads. The GH200 meanwhile combines an H100 chip [1.] with an Arm CPU and more memory.

Photo Credit: Nvidia

Note 1. The Nvidia H100, sits in a 10.5 inch graphics card which is then bundled together into a server rack alongside dozens of other H100 cards to create one massive data center computer.

This week, Nvidia informed Microsoft and another major cloud service provider of a delay in the production of its most advanced AI chip in the Blackwell series, the Information website said, citing a Microsoft employee and another person with knowledge of the matter.

…………………………………………………………………………………………………………………………………………

Nvidia Competitors Emerge – but are their chips ONLY for internal use?

In addition to AMD, Nvidia has several big tech competitors that are currently not in the merchant market semiconductor business. These include:

  • Huawei has developed the Ascend series of chips to rival Nvidia’s AI chips, with the Ascend 910B chip as its main competitor to Nvidia’s A100 GPU chip. Huawei is the second largest cloud services provider in China, just behind Alibaba and ahead of Tencent.
  • Microsoft has unveiled an AI chip called the Azure Maia AI Accelerator, optimized for artificial intelligence (AI) tasks and generative AI as well as the Azure Cobalt CPU, an Arm-based processor tailored to run general purpose compute workloads on the Microsoft Cloud.
  • Last year, Meta announced it was developing its own AI hardware. This past April, Meta announced its next generation of custom-made processor chips designed for their AI workloads. The latest version significantly improves performance compared to the last generation and helps power their ranking and recommendation ads models on Facebook and Instagram.
  • Also in April, Google revealed the details of a new version of its data center AI chips and announced an Arm-based based central processor. Google’s 10 year old Tensor Processing Units (TPUs) are one of the few viable alternatives to the advanced AI chips made by Nvidia, though developers can only access them through Google’s Cloud Platform and not buy them directly.

As demand for generative AI services continues to grow, it’s evident that GPU chips will be the next big battleground for AI supremacy.

References:

AI Frenzy Backgrounder; Review of AI Products and Services from Nvidia, Microsoft, Amazon, Google and Meta; Conclusions

https://www.nvidia.com/en-us/data-center/grace-hopper-superchip/

https://www.theverge.com/2024/2/1/24058186/ai-chips-meta-microsoft-google-nvidia/archives/2

https://news.microsoft.com/source/features/ai/in-house-chips-silicon-to-service-to-meet-ai-demand/

https://www.reuters.com/technology/artificial-intelligence/delay-nvidias-new-ai-chip-could-affect-microsoft-google-meta-information-says-2024-08-03/

https://www.theinformation.com/articles/nvidias-new-ai-chip-is-delayed-impacting-microsoft-google-meta

AI wave stimulates big tech spending and strong profits, but for how long?

Big tech companies have made it clear over the last week that they have no intention of slowing down their stunning levels of spending on artificial intelligence (AI), even though investors are getting worried that a big payoff is further down the line than most believe.

In the last quarter, Apple, Amazon, Meta, Microsoft and Google’s parent company Alphabet spent a combined $59 billion on capital expenses, 63% more than a year earlier and 161 percent more than four years ago. A large part of that was funneled into building data centers and packing them with new computer systems to build artificial intelligence. Only Apple has not dramatically increased spending, because it does not build the most advanced AI systems and is not a cloud service provider like the others.

At the beginning of this year, Meta said it would spend more than $30 billion in 2024 on new tech infrastructure. In April, he raised that to $35 billion. On Wednesday, he increased it to at least $37 billion. CEO Mark Zuckerberg said Meta would spend even more next year.  He said he’d rather build too fast “rather than too late,” and allow his competitors to get a big lead in the A.I. race. Meta gives away the advanced A.I. systems it develops, but Mr. Zuckerberg still said it was worth it. “Part of what’s important about A.I. is that it can be used to improve all of our products in almost every way,” he said.

………………………………………………………………………………………………………………………………………………………..

This new wave of Generative A.I. is incredibly expensive. The systems work with vast amounts of data and require sophisticated computer chips and new data centers to develop the technology and serve it to customers. The companies are seeing some sales from their A.I. work, but it is barely moving the needle financially.

In recent months, several high-profile tech industry watchers, including Goldman Sachs’s head of equity research and a partner at the venture firm Sequoia Capital, have questioned when or if A.I. will ever produce enough benefit to bring in the sales needed to cover its staggering costs. It is not clear that AI will come close to having the same impact as the internet or mobile phones, Goldman’s Jim Covello wrote in a June report.

“What $1 trillion problem will AI solve?” he wrote. “Replacing low wage jobs with tremendously costly technology is basically the polar opposite of the prior technology transitions I’ve witnessed in my 30 years of closely following the tech industry.” “The reality right now is that while we’re investing a significant amount in the AI.space and in infrastructure, we would like to have more capacity than we already have today,” said Andy Jassy, Amazon’s chief executive. “I mean, we have a lot of demand right now.”

That means buying land, building data centers and all the computers, chips and gear that go into them. Amazon executives put a positive spin on all that spending. “We use that to drive revenue and free cash flow for the next decade and beyond,” said Brian Olsavsky, the company’s finance chief.

There are plenty of signs the boom will persist. In mid-July, Taiwan Semiconductor Manufacturing Company, which makes most of the in-demand chips designed by Nvidia (the ONLY tech company that is now making money from AI – much more below) that are used in AI systems, said those chips would be in scarce supply until the end of 2025.

Mr. Zuckerberg said AI’s potential is super exciting. “It’s why there are all the jokes about how all the tech C.E.O.s get on these earnings calls and just talk about A.I. the whole time.”

……………………………………………………………………………………………………………………

Big tech profits and revenue continue to grow, but will massive spending produce a good ROI?

Last week’s Q2-2024 results:

  • Google parent Alphabet reported $24 billion net profit on $85 billion revenue.
  • Microsoft reported $22 billion net profit on $65 billion revenue.
  • Meta reported $13.5 billion net profit on $39 billion revenue.
  • Apple reported $21 billion net profit on $86 billion revenue.
  • Amazon reported $13.5 billion net profit on $148 billion revenue.

This chart sums it all up:

………………………………………………………………………………………………………………………………………………………..

References:

https://www.nytimes.com/2024/08/02/technology/tech-companies-ai-spending.html

https://www.wsj.com/business/telecom/amazon-apple-earnings-63314b6c?st=40v8du7p5rxq72j&reflink=desktopwebshare_permalink

https://www.axios.com/2024/08/02/google-microsoft-meta-ai-earnings

https://www.nvidia.com/en-us/data-center/grace-hopper-superchip/

AI Frenzy Backgrounder; Review of AI Products and Services from Nvidia, Microsoft, Amazon, Google and Meta; Conclusions

 

Amdocs and NVIDIA to Accelerate Adoption of Generative AI for $1.7 Trillion Telecom Industry

Amdocs and NVIDIA today announced they are collaborating to optimize large language models (LLMs) to speed adoption of generative AI applications and services across the $1.7 trillion telecommunications and media industries.(1)

Amdocs and NVIDIA will customize enterprise-grade LLMs running on NVIDIA accelerated computing as part of the Amdocs amAIz framework. The collaboration will empower communications service providers to efficiently deploy generative AI use cases across their businesses, from customer experiences to network provisioning.

Amdocs will use NVIDIA DGX Cloud AI supercomputing and NVIDIA AI Enterprise software to support flexible adoption strategies and help ensure service providers can simply and safely use generative AI applications.

Aligned with the Amdocs strategy of advancing generative AI use cases across the industry, the collaboration with NVIDIA builds on the previously announced Amdocs-Microsoft partnership. Service providers and media companies can adopt these applications in secure and trusted environments, including on premises and in the cloud.

With these new capabilities — including the NVIDIA NeMo framework for custom LLM development and guardrail features — service providers can benefit from enhanced performance, optimized resource utilization and flexible scalability to support emerging and future needs.

“NVIDIA and Amdocs are partnering to bring a unique platform and unmatched value proposition to customers,” said Shuky Sheffer, Amdocs Management Limited president and CEO. “By combining NVIDIA’s cutting-edge AI infrastructure, software and ecosystem and Amdocs’ industry-first amAlz AI framework, we believe that we have an unmatched offering that is both future-ready and value-additive for our customers.”

“Across a broad range of industries, enterprises are looking for the fastest, safest path to apply generative AI to boost productivity,” said Jensen Huang, founder and CEO of NVIDIA. “Our collaboration with Amdocs will help telco service providers automate personalized assistants, service ticket routing and other use cases for their billions of customers, and help the telcos analyze and optimize their operations.”

Amdocs counts more than 350 of the world’s leading telecom and media companies as customers, including 27 of the world’s top 30 service providers.(2) With more than 1.7 billion daily digital journeys, Amdocs platforms impact more than 3 billion people around the world.

NVIDIA and Amdocs are exploring a number of generative AI use cases to simplify and improve operations by providing secure, cost-effective and high-performance generative AI capabilities.

Initial use cases span customer care, including accelerating customer inquiry resolution by drawing information from across company data. On the network operations side, the companies are exploring how to proactively generate solutions that aid configuration, coverage or performance issues as they arise.

(1) Source: IDC, OMDIA, Factset analyses of Telecom 2022-2023 revenue.
(2) Source: OMDIA 2022 revenue estimates, excludes China.

Editor’s Note:

Generative AI uses a variety of AI models, including: 

  • Language models: These models, like OpenAI’s GPT-3, generate human-like text. One of the most popular examples of language-based generative models are called large language models (LLMs).
  • Large language models are being leveraged for a wide variety of tasks, including essay generation, code development, translation, and even understanding genetic sequences.
  • Generative adversarial networks (GANs): These models use two neural networks, a generator, and a discriminator.
  • Unimodal models: These models only accept one data input format.
  • Multimodal models: These models accept multiple types of inputs and prompts. For example, GPT-4 can accept both text and images as inputs.
  • Variational autoencoders (VAEs): These deep learning architectures are frequently used to build generative AI models.
  • Foundation models: These models generate output from one or more inputs (prompts) in the form of human language instructions.
Other types of generative AI models include:  Neural networks, Genetic algorithms, Rule-based systems, Transformers, LaMDA, LLaMA, BLOOM, BERT, RoBERTa. 
…………………………………………………………………………………………………………………………………
References:

https://nvidianews.nvidia.com/news/amdocs-and-nvidia-to-accelerate-adoption-of-generative-ai-for-1-7-trillion-telecom-industry

https://www.nvidia.com/en-us/glossary/data-science/generative-ai/

https://blogs.nvidia.com/blog/2023/01/26/what-are-large-language-models-used-for/

Cloud Service Providers struggle with Generative AI; Users face vendor lock-in; “The hype is here, the revenue is not”

Global Telco AI Alliance to progress generative AI for telcos

Bain & Co, McKinsey & Co, AWS suggest how telcos can use and adapt Generative AI

Generative AI Unicorns Rule the Startup Roost; OpenAI in the Spotlight

Generative AI in telecom; ChatGPT as a manager? ChatGPT vs Google Search

Generative AI could put telecom jobs in jeopardy; compelling AI in telecom use cases