Nvidia
Nvidia enters Data Center Ethernet market with its Spectrum-X networking platform
Nvidia is planning a big push into the Data Center Ethernet market. CFO Colette Kress said the Spectrum-X Ethernet-based networking solution it launched in May 2023 is “well on track to begin a multi-billion-dollar product line within a year.” The Spectrum-X platform includes: Ethernet switches, optics, cables and network interface cards (NICs). Nvidia already has a multi-billion-dollar play in this space in the form of its Ethernet NIC product. Kress said during Nvidia’s earnings call that “hundreds of customers have already adopted the platform.” And that Nvidia plans to “launch new Spectrum-X products every year to support demand for scaling compute clusters from tens of thousands of GPUs today to millions of DPUs in the near future.”
- With Spectrum-X, Nvidia will be competing with Arista, Cisco, and Juniper at the system level along with “bare metal switches” from Taiwanese ODMs running DriveNets network cloud software.
- With respect to high performance Ethernet switching silicon, Nvidia competitors include Broadcom, Marvell, Microchip, and Cisco (which uses Silicon One internally and also sells it on the merchant semiconductor market).
Image by Midjourney for Fierce Network
…………………………………………………………………………………………………………………………………………………………………………..
In November 2023, Nvidia said it would work with Dell Technologies, Hewlett Packard Enterprise and Lenovo to incorporate Spectrum-X capabilities into their compute servers. Nvidia is now targeting tier-2 cloud service providers and enterprise customers looking for bundled solutions.
Dell’Oro Group VP Sameh Boujelbene told Fierce Network that “Nvidia is positioning Spectrum-X for AI back-end network deployments as an alternative fabric to InfiniBand. While InfiniBand currently dominates AI back-end networks with over 80% market share, Ethernet switches optimized for AI deployments have been gaining ground very quickly.” Boujelbene added Nvidia’s success with Spectrum-X thus far has largely been driven “by one major 100,000-GPU cluster, along with several smaller deployments by Cloud Service Providers.” By 2028, Boujelbene said Dell’Oro expects Ethernet switches to surpass InfiniBand for AI in the back-end network market, with revenues exceeding $10 billion.
………………………………………………………………………………………………………………………………………………………………………………
In a recent IEEE Techblog post we wrote:
While InfiniBand currently has the edge in the data center networking market, but several factors point to increased Ethernet adoption for AI clusters in the future. Recent innovations are addressing Ethernet’s shortcomings compared to InfiniBand:
- Lossless Ethernet technologies
- RDMA over Converged Ethernet (RoCE)
- Ultra Ethernet Consortium’s AI-focused specifications
Some real-world tests have shown Ethernet offering up to 10% improvement in job completion performance across all packet sizes compared to InfiniBand in complex AI training tasks. By 2028, it’s estimated that: 1] 45% of generative AI workloads will run on Ethernet (up from <20% now) and 2] 30% will run on InfiniBand (up from <20% now).
………………………………………………………………………………………………………………………………………………………………………………
References:
https://www.fierce-network.com/cloud/data-center-ethernet-nvidias-next-multi-billion-dollar-business
https://www.nvidia.com/en-us/networking/spectrumx/
Will AI clusters be interconnected via Infiniband or Ethernet: NVIDIA doesn’t care, but Broadcom sure does!
Data Center Networking Market to grow at a CAGR of 6.22% during 2022-2027 to reach $35.6 billion by 2027
LightCounting: Optical Ethernet Transceiver sales will increase by 40% in 2024
Will AI clusters be interconnected via Infiniband or Ethernet: NVIDIA doesn’t care, but Broadcom sure does!
InfiniBand, which has been used extensively for HPC interconnect, currently dominates AI networking accounting for about 90% of deployments. That is largely due to its very low latency and architecture that reduces packet loss, which is beneficial for AI training workloads. Packet loss slows AI training workloads, and they’re already expensive and time-consuming. This is probably why Microsoft chose to run InfiniBand when building out its data centers to support machine learning workloads. However, InfiniBand tends to lag Ethernet in terms of top speeds. Nvidia’s very latest Quantum InfiniBand switch tops out at 51.2 Tb/s with 400 Gb/s ports. By comparison, Ethernet switching hit 51.2 Tb/s nearly two years ago and can support 800 Gb/s port speeds.
While InfiniBand currently has the edge, several factors point to increased Ethernet adoption for AI clusters in the future. Recent innovations are addressing Ethernet’s shortcomings compared to InfiniBand:
- Lossless Ethernet technologies
- RDMA over Converged Ethernet (RoCE)
- Ultra Ethernet Consortium’s AI-focused specifications
Some real-world tests have shown Ethernet offering up to 10% improvement in job completion performance across all packet sizes compared to InfiniBand in complex AI training tasks. By 2028, it’s estimated that: 1] 45% of generative AI workloads will run on Ethernet (up from <20% now) and 2] 30% will run on InfiniBand (up from <20% now).
In a lively session at VM Ware-Broadcom’s Explore event, panelists were asked how to best network together the GPUs, and other data center infrastructure, needed to deliver AI. Broadcom’s Ram Velaga, SVP and GM of the Core Switching Group, was unequivocal: “Ethernet will be the technology to make this happen.” Velaga opening remarks asked the audience, “Think about…what is machine learning and how is that different from cloud computing?” Cloud computing, he said, is about driving utilization of CPUs; with ML, it’s the opposite.
“No one…machine learning workload can run on a single GPU…No single GPU can run an entire machine learning workload. You have to connect many GPUs together…so machine learning is a distributed computing problem. It’s actually the opposite of a cloud computing problem,” Velaga added.
Nvidia (which acquired Israel interconnect fabless chip maker Mellanox [1.] in 2019) says, “Infiniband provides dramatic leaps in performance to achieve faster time to discovery with less cost and complexity.” Velaga disagrees saying “InfiniBand is expensive, fragile and predicated on the faulty assumption that the physical infrastructure is lossless.”
Note 1. Mellanox specialized in switched fabrics for enterprise data centers and high performance computing, when high data rates and low latency are required such as in a computer cluster.
…………………………………………………………………………………………………………………………………………..
Ethernet, on the other hand, has been the subject of ongoing innovation and advancement since, he cited the following selling points:
- Pervasive deployment
- Open and standards-based
- Highest Remote Direct Access Memory (RDMA) performance for AI fabrics
- Lowest cost compared to proprietary tech
- Consistent across front-end, back-end, storage and management networks
- High availability, reliability and ease of use
- Broad silicon, hardware, software, automation, monitoring and debugging solutions from a large ecosystem
To that last point, Velaga said, “We steadfastly have been innovating in this world of Ethernet. When there’s so much competition, you have no choice but to innovate.” InfiniBand, he said, is “a road to nowhere.” It should be noted that Broadcom (which now owns VMWare) is the largest supplier of Ethernet switching chips for every part of a service provider network (see diagram below). Broadcom’s Jericho3-AI silicon, which can connect up to 32,000 GPU chips together, competes head-on with InfiniBand!
Image Courtesy of Broadcom
………………………………………………………………………………………………………………………………………………………..
Conclusions:
While InfiniBand currently dominates AI networking, Ethernet is rapidly evolving to meet AI workload demands. The future will likely see a mix of both technologies, with Ethernet gaining significant ground due to its improvements, cost-effectiveness, and widespread compatibility. Organizations will need to evaluate their specific needs, considering factors like performance requirements, existing infrastructure, and long-term scalability when choosing between InfiniBand and Ethernet for AI clusters.
–>Well, it turns out that Nvidia’s Mellanox division in Israel makes BOTH Infiniband AND Ethernet chips so they win either way!
…………………………………………………………………………………………………………………………………………………………………………..
References:
https://www.perplexity.ai/search/will-ai-clusters-run-on-infini-uCYEbRjeR9iKAYH75gz8ZA
https://www.theregister.com/2024/01/24/ai_networks_infiniband_vs_ethernet/
Broadcom on AI infrastructure networking—’Ethernet will be the technology to make this happen’
https://www.nvidia.com/en-us/networking/products/infiniband/h
ttps://www.nvidia.com/en-us/networking/products/ethernet/
Part1: Unleashing Network Potentials: Current State and Future Possibilities with AI/ML
Using a distributed synchronized fabric for parallel computing workloads- Part II
Part-2: Unleashing Network Potentials: Current State and Future Possibilities with AI/ML
AI Echo Chamber: “Upstream AI” companies huge spending fuels profit growth for “Downstream AI” firms
According to the Wall Street Journal, the AI industry has become an “Echo Chamber,” where huge capital spending by the AI infrastructure and application providers have fueled revenue and profit growth for everyone else. Market research firm Bespoke Investment Group has recently created baskets for “downstream” and “upstream” AI companies.
- The Downstream group involves “AI implementation,” which consist of firms that sell AI development tools, such as the large language models (LLMs) popularized by OpenAI’s ChatGPT since the end of 2022, or run products that can incorporate them. This includes Google/Alphabet, Microsoft, Amazon, Meta Platforms (FB), along with IBM, Adobe and Salesforce.
- Higher up the supply chain (Upstream group), are the “AI infrastructure” providers, which sell AI chips, applications, data centers and training software. The undisputed leader is Nvidia, which has seen its sales triple in a year, but it also includes other semiconductor companies, database developer Oracle and owners of data centers Equinix and Digital Realty.
The Upstream group of companies have posted profit margins that are far above what analysts expected a year ago. In the second quarter, and pending Nvidia’s results on Aug. 28th , Upstream AI members of the S&P 500 are set to have delivered a 50% annual increase in earnings. For the remainder of 2024, they will be increasingly responsible for the profit growth that Wall Street expects from the stock market—even accounting for Intel’s huge problems and restructuring.
It should be noted that the lines between the two groups can be blurry, particularly when it comes to giants such as Amazon, Microsoft and Alphabet, which provide both AI implementation (e.g. LLMs) and infrastructure: Their cloud-computing businesses are responsible for turning these companies into the early winners of the AI craze last year and reported breakneck growth during this latest earnings season. A crucial point is that it is their role as ultimate developers of AI applications that have led them to make super huge capital expenditures, which are responsible for the profit surge in the rest of the ecosystem. So there is a definite trickle down effect where the big tech players AI directed CAPEX is boosting revenue and profits for the companies down the supply chain.
As the path for monetizing this technology gets longer and harder, the benefits seem to be increasingly accruing to companies higher up in the supply chain. Meta Platforms Chief Executive Mark Zuckerberg recently said the company’s coming Llama 4 language model will require 10 times as much computing power to train as its predecessor. Were it not for AI, revenues for semiconductor firms would probably have fallen during the second quarter, rather than rise 18%, according to S&P Global.
………………………………………………………………………………………………………………………………………………………..
A paper written by researchers from the likes of Cambridge and Oxford uncovered that the large language models (LLMs) behind some of today’s most exciting AI apps may have been trained on “synthetic data” or data generated by other AI. This revelation raises ethical and quality concerns. If an AI model is trained primarily or even partially on synthetic data, it might produce outputs lacking human-generated content’s richness and reliability. It could be a case of the blind leading the blind, with AI models reinforcing the limitations or biases inherent in the synthetic data they were trained on.
In this paper, the team coined the phrase “model collapse,” claiming that training models this way will answer user prompts with low-quality outputs. The idea of “model collapse” suggests a sort of unraveling of the machine’s learning capabilities, where it fails to produce outputs with the informative or nuanced characteristics we expect. This poses a serious question for the future of AI development. If AI is increasingly trained on synthetic data, we risk creating echo chambers of misinformation or low-quality responses, leading to less helpful and potentially even misleading systems.
……………………………………………………………………………………………………………………………………………
In a recent working paper, Massachusetts Institute of Technology (MIT) economist Daron Acemoglu argued that AI’s knack for easy tasks has led to exaggerated predictions of its power to enhance productivity in hard jobs. Also, some of the new tasks created by AI may have negative social value (such as design of algorithms for online manipulation). Indeed, data from the Census Bureau show that only a small percentage of U.S. companies outside of the information and knowledge sectors are looking to make use of AI.
References:
https://deepgram.com/learn/the-ai-echo-chamber-model-collapse-synthetic-data-risks
https://economics.mit.edu/sites/default/files/2024-04/The%20Simple%20Macroeconomics%20of%20AI.pdf
AI wave stimulates big tech spending and strong profits, but for how long?
AI winner Nvidia faces competition with new super chip delayed
SK Telecom and Singtel partner to develop next-generation telco technologies using AI
Telecom and AI Status in the EU
Vodafone: GenAI overhyped, will spend $151M to enhance its chatbot with AI
Data infrastructure software: picks and shovels for AI; Hyperscaler CAPEX
AI winner Nvidia faces competition with new super chip delayed
The Clear AI Winner Is: Nvidia!
Strong AI spending should help Nvidia make its own ambitious numbers when it reports earnings at the end of the month (it’s 2Q-2024 ended July 31st). Analysts are expecting nearly $25 billion in data center revenue for the July quarter—about what that business was generating annually a year ago. But the latest results won’t quell the growing concern investors have with the pace of AI spending among the world’s largest tech giants—and how it will eventually pay off.
In March, Nvidia unveiled its Blackwell chip series, succeeding its earlier flagship AI chip, the GH200 Grace Hopper Superchip, which was designed to speed generative AI applications. The NVIDIA GH200 NVL2 fully connects two GH200 Superchips with NVLink, delivering up to 288GB of high-bandwidth memory, 10 terabytes per second (TB/s) of memory bandwidth, and 1.2TB of fast memory. The GH200 NVL2 offers up to 3.5X more GPU memory capacity and 3X more bandwidth than the NVIDIA H100 Tensor Core GPU in a single server for compute- and memory-intensive workloads. The GH200 meanwhile combines an H100 chip [1.] with an Arm CPU and more memory.
Photo Credit: Nvidia
Note 1. The Nvidia H100, sits in a 10.5 inch graphics card which is then bundled together into a server rack alongside dozens of other H100 cards to create one massive data center computer.
This week, Nvidia informed Microsoft and another major cloud service provider of a delay in the production of its most advanced AI chip in the Blackwell series, the Information website said, citing a Microsoft employee and another person with knowledge of the matter.
…………………………………………………………………………………………………………………………………………
Nvidia Competitors Emerge – but are their chips ONLY for internal use?
In addition to AMD, Nvidia has several big tech competitors that are currently not in the merchant market semiconductor business. These include:
- Huawei has developed the Ascend series of chips to rival Nvidia’s AI chips, with the Ascend 910B chip as its main competitor to Nvidia’s A100 GPU chip. Huawei is the second largest cloud services provider in China, just behind Alibaba and ahead of Tencent.
- Microsoft has unveiled an AI chip called the Azure Maia AI Accelerator, optimized for artificial intelligence (AI) tasks and generative AI as well as the Azure Cobalt CPU, an Arm-based processor tailored to run general purpose compute workloads on the Microsoft Cloud.
- Last year, Meta announced it was developing its own AI hardware. This past April, Meta announced its next generation of custom-made processor chips designed for their AI workloads. The latest version significantly improves performance compared to the last generation and helps power their ranking and recommendation ads models on Facebook and Instagram.
- Also in April, Google revealed the details of a new version of its data center AI chips and announced an Arm-based based central processor. Google’s 10 year old Tensor Processing Units (TPUs) are one of the few viable alternatives to the advanced AI chips made by Nvidia, though developers can only access them through Google’s Cloud Platform and not buy them directly.
As demand for generative AI services continues to grow, it’s evident that GPU chips will be the next big battleground for AI supremacy.
References:
AI Frenzy Backgrounder; Review of AI Products and Services from Nvidia, Microsoft, Amazon, Google and Meta; Conclusions
https://www.nvidia.com/en-us/data-center/grace-hopper-superchip/
https://www.theverge.com/2024/2/1/24058186/ai-chips-meta-microsoft-google-nvidia/archives/2
https://news.microsoft.com/source/features/ai/in-house-chips-silicon-to-service-to-meet-ai-demand/
AI wave stimulates big tech spending and strong profits, but for how long?
Big tech companies have made it clear over the last week that they have no intention of slowing down their stunning levels of spending on artificial intelligence (AI), even though investors are getting worried that a big payoff is further down the line than most believe.
In the last quarter, Apple, Amazon, Meta, Microsoft and Google’s parent company Alphabet spent a combined $59 billion on capital expenses, 63% more than a year earlier and 161 percent more than four years ago. A large part of that was funneled into building data centers and packing them with new computer systems to build artificial intelligence. Only Apple has not dramatically increased spending, because it does not build the most advanced AI systems and is not a cloud service provider like the others.
At the beginning of this year, Meta said it would spend more than $30 billion in 2024 on new tech infrastructure. In April, he raised that to $35 billion. On Wednesday, he increased it to at least $37 billion. CEO Mark Zuckerberg said Meta would spend even more next year. He said he’d rather build too fast “rather than too late,” and allow his competitors to get a big lead in the A.I. race. Meta gives away the advanced A.I. systems it develops, but Mr. Zuckerberg still said it was worth it. “Part of what’s important about A.I. is that it can be used to improve all of our products in almost every way,” he said.
………………………………………………………………………………………………………………………………………………………..
This new wave of Generative A.I. is incredibly expensive. The systems work with vast amounts of data and require sophisticated computer chips and new data centers to develop the technology and serve it to customers. The companies are seeing some sales from their A.I. work, but it is barely moving the needle financially.
In recent months, several high-profile tech industry watchers, including Goldman Sachs’s head of equity research and a partner at the venture firm Sequoia Capital, have questioned when or if A.I. will ever produce enough benefit to bring in the sales needed to cover its staggering costs. It is not clear that AI will come close to having the same impact as the internet or mobile phones, Goldman’s Jim Covello wrote in a June report.
“What $1 trillion problem will AI solve?” he wrote. “Replacing low wage jobs with tremendously costly technology is basically the polar opposite of the prior technology transitions I’ve witnessed in my 30 years of closely following the tech industry.” “The reality right now is that while we’re investing a significant amount in the AI.space and in infrastructure, we would like to have more capacity than we already have today,” said Andy Jassy, Amazon’s chief executive. “I mean, we have a lot of demand right now.”
That means buying land, building data centers and all the computers, chips and gear that go into them. Amazon executives put a positive spin on all that spending. “We use that to drive revenue and free cash flow for the next decade and beyond,” said Brian Olsavsky, the company’s finance chief.
There are plenty of signs the boom will persist. In mid-July, Taiwan Semiconductor Manufacturing Company, which makes most of the in-demand chips designed by Nvidia (the ONLY tech company that is now making money from AI – much more below) that are used in AI systems, said those chips would be in scarce supply until the end of 2025.
Mr. Zuckerberg said AI’s potential is super exciting. “It’s why there are all the jokes about how all the tech C.E.O.s get on these earnings calls and just talk about A.I. the whole time.”
……………………………………………………………………………………………………………………
Big tech profits and revenue continue to grow, but will massive spending produce a good ROI?
Last week’s Q2-2024 results:
- Google parent Alphabet reported $24 billion net profit on $85 billion revenue.
- Microsoft reported $22 billion net profit on $65 billion revenue.
- Meta reported $13.5 billion net profit on $39 billion revenue.
- Apple reported $21 billion net profit on $86 billion revenue.
- Amazon reported $13.5 billion net profit on $148 billion revenue.
This chart sums it all up:
………………………………………………………………………………………………………………………………………………………..
References:
https://www.nytimes.com/2024/08/02/technology/tech-companies-ai-spending.html
https://www.axios.com/2024/08/02/google-microsoft-meta-ai-earnings
https://www.nvidia.com/en-us/data-center/grace-hopper-superchip/
AI Frenzy Backgrounder; Review of AI Products and Services from Nvidia, Microsoft, Amazon, Google and Meta; Conclusions
Amdocs and NVIDIA to Accelerate Adoption of Generative AI for $1.7 Trillion Telecom Industry
Amdocs and NVIDIA today announced they are collaborating to optimize large language models (LLMs) to speed adoption of generative AI applications and services across the $1.7 trillion telecommunications and media industries.(1)
Amdocs and NVIDIA will customize enterprise-grade LLMs running on NVIDIA accelerated computing as part of the Amdocs amAIz framework. The collaboration will empower communications service providers to efficiently deploy generative AI use cases across their businesses, from customer experiences to network provisioning.
Amdocs will use NVIDIA DGX Cloud AI supercomputing and NVIDIA AI Enterprise software to support flexible adoption strategies and help ensure service providers can simply and safely use generative AI applications.
Aligned with the Amdocs strategy of advancing generative AI use cases across the industry, the collaboration with NVIDIA builds on the previously announced Amdocs-Microsoft partnership. Service providers and media companies can adopt these applications in secure and trusted environments, including on premises and in the cloud.
With these new capabilities — including the NVIDIA NeMo framework for custom LLM development and guardrail features — service providers can benefit from enhanced performance, optimized resource utilization and flexible scalability to support emerging and future needs.
“NVIDIA and Amdocs are partnering to bring a unique platform and unmatched value proposition to customers,” said Shuky Sheffer, Amdocs Management Limited president and CEO. “By combining NVIDIA’s cutting-edge AI infrastructure, software and ecosystem and Amdocs’ industry-first amAlz AI framework, we believe that we have an unmatched offering that is both future-ready and value-additive for our customers.”
“Across a broad range of industries, enterprises are looking for the fastest, safest path to apply generative AI to boost productivity,” said Jensen Huang, founder and CEO of NVIDIA. “Our collaboration with Amdocs will help telco service providers automate personalized assistants, service ticket routing and other use cases for their billions of customers, and help the telcos analyze and optimize their operations.”
Amdocs counts more than 350 of the world’s leading telecom and media companies as customers, including 27 of the world’s top 30 service providers.(2) With more than 1.7 billion daily digital journeys, Amdocs platforms impact more than 3 billion people around the world.
NVIDIA and Amdocs are exploring a number of generative AI use cases to simplify and improve operations by providing secure, cost-effective and high-performance generative AI capabilities.
Initial use cases span customer care, including accelerating customer inquiry resolution by drawing information from across company data. On the network operations side, the companies are exploring how to proactively generate solutions that aid configuration, coverage or performance issues as they arise.
(1) Source: IDC, OMDIA, Factset analyses of Telecom 2022-2023 revenue.
(2) Source: OMDIA 2022 revenue estimates, excludes China.
Editor’s Note:
- Language models: These models, like OpenAI’s GPT-3, generate human-like text. One of the most popular examples of language-based generative models are called large language models (LLMs).
- Large language models are being leveraged for a wide variety of tasks, including essay generation, code development, translation, and even understanding genetic sequences.
- Generative adversarial networks (GANs): These models use two neural networks, a generator, and a discriminator.
- Unimodal models: These models only accept one data input format.
- Multimodal models: These models accept multiple types of inputs and prompts. For example, GPT-4 can accept both text and images as inputs.
- Variational autoencoders (VAEs): These deep learning architectures are frequently used to build generative AI models.
- Foundation models: These models generate output from one or more inputs (prompts) in the form of human language instructions.
https://www.nvidia.com/en-us/glossary/data-science/generative-ai/
https://blogs.nvidia.com/blog/2023/01/26/what-are-large-language-models-used-for/
Cloud Service Providers struggle with Generative AI; Users face vendor lock-in; “The hype is here, the revenue is not”
Global Telco AI Alliance to progress generative AI for telcos
Bain & Co, McKinsey & Co, AWS suggest how telcos can use and adapt Generative AI
Generative AI Unicorns Rule the Startup Roost; OpenAI in the Spotlight
Generative AI in telecom; ChatGPT as a manager? ChatGPT vs Google Search
Generative AI could put telecom jobs in jeopardy; compelling AI in telecom use cases