Using a distributed synchronized fabric for parallel computing workloads- Part II
by Run Almog Head of Product Strategy, Drivenets (edited by Alan J Weissberger)
In the previous part I article, we covered the different attributes of AI/HPC workloads and the impact this has on requirements from the network that serves these applications. This concluding part II article will focus on an open standard solution that addresses these needs and enables these mega sized applications to run larger workloads without compromising on network attributes. Various solutions are described and contrasted along with a perspective from silicon vendors.
Networking for HPC/AI:
A networking solution serving HPC/AI workloads will need to carry certain attributes. Starting with scale of the network which can reach thousands of high speed endpoints and having all these endpoints run the same application in a synchronized manner. This requires the network to run like a scheduled fabric that offers full bandwidth between any group of endpoints at any given time.
Distributed Disaggregated Chassis (DDC):
DDC is an architecture that was originally defined by AT&T and contributed to the Open Compute Project (OCP) as an open architecture in September 2019. DDC defines the components and internal connectivity of a network element that is purposed to serve as a carrier grade network router. As opposed to the monolithic chassis-based router, the DDC defines every component of the router as a standalone device.
- The line card of the chassis is defined as a distributed chassis packet-forwarder (DCP)
- The fabric card of the chassis is defined as a distributed chassis fabric (DCF)
- The routing stack of the chassis is defined as a distributed chassis controller (DCC)
- The management card of the chassis is defined as a distributed chassis manager (DCM)
- All devices are physically connected to the DCM via standard 10GbE interfaces to establish a control and a management plane.
- All DCP are connected to all DCF via 400G fabric interfaces in a Clos-3 topology to establish a scheduled and non-blocking data plane between all network ports in the DDC.
- DCP hosts both fabric ports for connecting to DCF and network ports for connecting to other network devices using standard Ethernet/IP protocols while DCF does not host any network ports.
- The DCC is in fact a server and is used to run the main base operating system (BaseOS) that defines the functionality of the DDC
Advantages of the DDC are the following:
- It’s capacity since there is no metal chassis enclosure that needs to hold all these components into a single machine. This allows building a wider Clos-3 topology that expands beyond the boundaries of a single rack making it possible for thousands of interfaces to coexist on the same network element (router).
- It is an open standard definition which makes it possible for multiple vendors to implement the components and as a result, making it easier for the operator (Telco) to establish a multi-source procurement methodology and stay in control of price and supply chain within his network as it evolves.
- It is a distributed array of components that each has an ability to exist as a standalone as well as act as part of the DDC. This gives a very high level of resiliency to services running over a DDC based router vs. services running over a chassis-based router.
AT&T announced they use DDC clusters to run their core MPLS in a DriveNets based implementation and as standalone edge and peering IP networks while other operators worldwide are also using DDC for such functionality.
Figure 1: High level connectivity structure of a DDC
LC is defined as DCP above, Fabric module is defined as DCF above, RP is defined as DCC above, Ethernet SW is defined as DCM above
Source: OCP DDC specification
DDC is implementing a concept of disaggregation. The decoupling of the control plane from data plane enables the sourcing of the software and hardware from different vendors and assembling them back into a unified network element when deployed. This concept is rather new but still has had a lot of successful deployments prior to it being used as part of DDC.
Disaggregation in Data Centers:
The implementation of a detached data plane from the control plane had major adoption in data center networks in recent years. Sourcing the software (control plane) from one vendor while the hardware (data plane) is sourced from a different vendor mandate that the interfaces between the software and hardware be very precise and well defined. This has brought up a few components which were developed by certain vendors and contributed to the community to allow for the concept of disaggregation to go beyond the boundaries of implementation in specific customers networks.
Such components include open network install environment (ONIE) which enables mounting of the software image onto a platform (typically a single chip 1RU/2RU device) as well as the switch abstraction interface (SAI) which enable the software to directly access the application specific integrated circuit (ASIC) and operate directly onto the data plane at line rate speeds.
Two examples of implementing disaggregation networking in data centers are:
- Microsoft which developed their network operating system (NOS) software Sonic as one that runs on SAI and later contributed its source code to the networking community via OCP and he Linux foundation.
- Meta has defined devices called “wedge” who are purpose built to assume various NOS versions via standard interfaces.
These two examples of hyperscale companies are indicative to the required engineering effort to develop such interfaces and functions. The fact that such components have been made open is what enabled other smaller consumers to enjoy the benefits of disaggregation without the need to cater for large engineering groups.
The data center networking world today has a healthy ecosystem with hardware (ASIC and system) vendors as well as software (NOS and tools) which make a valid and widely used alternative to the traditional monolithic model of vertically integrated systems.
Reasons for deploying a disaggregated networking solution are a combination of two. First, is a clear financial advantage of buying white box equipment vs. the branded devices which carry a premium price. Second, is the flexibility which such solution enables, and this enables the customer to get better control over his network and how it’s run, as well as enable the network administrators a lot of room to innovate and adapt their network to their unique and changing needs.
The image below reflects a partial list of the potential vendors supplying components within the OCP networking community. The full OCP Membership directory is available at the OCP website.
Between DC and Telco Networking:
Data center networks are built to serve connectivity towards multiple servers which contain data or answer user queries. The size of data as well as number of queries towards it is a constantly growing function as humanity grows its consumption model of communication services. Traffic in and out of these servers is divided to north/south that indicates traffic coming in and goes out of the data center, and east/west that indicates traffic that runs inside the data center between different servers.
As a general pattern, the north/south traffic represent most of the traffic flows within the network while the east/west traffic represent the most bandwidth being consumed. This is not an accurate description of data center traffic, but it is accurate enough to explain the way data center networks are built and operated.
A data center switch connects to servers with a high-capacity link. This tier#1 switch is commonly known as a top of rack (ToR) switch and is a high capacity, non-blocking, low latency switch with some minimal routing capabilities.
- The ToR is then connected to a Tier#2 switch that enables it to connect to other ToR in the data center.
- The Tier#2 switches are connected to Tier#3 to further grow the connectivity.
- Traffic volumes are mainly east/west and best kept within the same Tier of the network to avoid scaling the routing tables.
- In theory, a Tier#4/5/6 of this network can exist, but this is not common.
- The higher Tier of the data center network is also connected to routers which interface the data center to the outside world (primarily the Internet) and these routers are a different design of a router than the tiers of switching devices mentioned earlier.
- These externally facing routers are commonly connected in a dual homed logic to create a level of redundancy for traffic to come in and out of the datacenter. Further functions on the ingress and egress of traffic towards data centers are also firewalled, load-balanced, address translated, etc. which are functions that are sometimes carried by the router and can also be carried by dedicated appliances.
As data centers density grew to allow better service level to consumers, the amount of traffic running between data center instances also grew and data center interconnect (DCI) traffic became predominant. A DCI router on the ingress/egress point of a data center instance is now a common practice and these devices typically connect over larger distance of fiber connectivity (tens to hundreds of Km) either towards other DCI routers or to Telco routers that is the infrastructure of the world wide web (AKA the Internet).
While data center network devices shine is their high capacity and low latency and are built from the ASIC level via the NOS they run to optimize on these attributes, they lack in their capacity for routing scale and distance between their neighboring routers. Telco routers however are built to host enough routes that “host” the Internet (a ballpark figure used in the industry is 1M routes according to CIDR) and a different structure of buffer (both size and allocation) to enable long haul connectivity. A telco router has a superset of capabilities vs. a data center switch and is priced differently due to the hardware it uses as well as the higher software complexity it requires which acts as a filter that narrows down the number of vendors that provide such solutions.
Attributes of an AI Cluster:
As described in a previous article HPC/AI workloads demand certain attributes from the network. Size, latency, lossless, high bandwidth and scale are all mandatory requirements and some solutions that are available are described in the next paragraphs.
Chassis Based Solutions:
This solution derives from Telco networking.
Chassis based routers are built as a black box with all its internal connectivity concealed from the user. It is often the case that the architecture used to implement the chassis is using line cards and fabric cards in a Clos-3 topology as described earlier to depict the structure of the DDC. As a result of this, the chassis behavior is predictable and reliable. It is in fact a lossless fabric wrapped in sheet metal with only its network interfaces facing the user. The caveat of a chassis in this case is its size. While a well-orchestrated fabric is a great fit for the network needs of AI workloads, it’s limited capacity of few hundred ports to connect to servers make this solution only fitting very small deployments.
In case chassis is used at a scale larger than the sum number of ports per single chassis, a Clos (this is in fact a non-balanced Clos-8 topology) of chassis is required and this breaks the fabric behavior of this model.
Standalone Ethernet Solutions:
This solution derives from data center networking.
As described previously in this paper, data center solutions are fast and can carry high bandwidth of traffic. They are however based on standalone single chip devices connected in a multi-tiered topology, typically a Clos-5 or Clos-7. as long as traffic is only running within the same device in this topology, behavior of traffic flows will be close to uniform. With the average number of interfaces per such device limited to the number of servers physically located in one rack, this single ToR device cannot satisfy the requirements of a large infrastructure. Expanding the network to higher tiers of the network also means that traffic patterns begin to alter, and application run-to-completion time is impacted. Furthermore, add-on mechanisms are mounted onto the network to turn the lossy network into a lossless one. Another attribute of the traffic pattern of AI workloads is the uniformity of the traffic flows from the perspective of the packet header. This means that the different packets of the same flow, will be identified by the data plane as the same traffic and be carried in the exact same path regardless of the network’s congestion situation, leaving parts of the Clos topology poorly utilized while other parts can be overloaded to a level of traffic loss.
Proprietary Locked Solutions:
Additional solutions in this field are implemented as a dedicated interconnect for a specific array of servers. This is more common in the scientific domain of heavy compute workloads, such as research labs, national institutes, and universities. As proprietary solutions, they force
the customer into one interconnect provider that serves the entire server array starting from the server itself and ending on all other servers in the array.
The nature of this industry is such where a one-time budget is allocated to build a “super-computer” which means that the resulting compute array is not expected to further grow but only be replaced or surmounted by a newer model. This makes the vendor-lock of choosing a proprietary interconnect solution more tolerable.
On the plus side of such solutions, they perform very well, and you can find examples on the top of the world’s strongest supercomputers list which use solutions from HPE (Slingshot), Intel (Omni-Path), Nvidia (InfiniBand) and more.
Perspective from Silicon Vendors:
DSF like solutions have been presented in the last OCP global summit back in October-2022 as part of the networking project discussions. Both Broadcom and Cisco (separately) have made claims of superior silicon implementation with improved power consumption or a superior implementation of a Virtual Output Queueing (VOQ) mechanism.
There are differences between AI and HPC workloads and the required network for each.
While the HPC market finds proprietary implementations of interconnect solutions acceptable for building secluded supercomputers for specific uses, the AI market requires solutions that allow more flexibility in their deployment and vendor selection. This boils down to Ethernet based solutions of various types.
Chassis and standalone Ethernet based solutions provide reasonable solutions up to the scale of a single machine but fail to efficiently scale beyond a single interconnect machine and keep the required performance to satisfy the running workloads.
A distributed fabric solution presents a standard solution that matches the forecasted industry need both in terms of scale and in terms of performance. Different silicon implementations that can construct a DSF are available. They differ slightly but all show substantial benefits vs. chassis or standard ethernet solutions.
This paper does not cover the different silicon types implementing the DSF architecture but only the alignment of DSF attributes to the requirements from interconnect solutions built to run AI workloads and the advantages of DSF vs. other solutions which are predominant in this space.
–>Please post a comment in the box below this article if you have any questions or requests for clarification for what we’ve presented here and in part I.
Using a distributed synchronized fabric for parallel computing workloads- Part I
Allied Market Research: Global AI in telecom market forecast to reach $38.8 by 2031 with CAGR of 41.4% (from 2022 to 2031)
Artificial Intelligence (AI) in telecom uses software and algorithms to estimate human perception in order to analyze big data such as data consumption, call record, and use of the application to improve the customer experience. Also, AI helps telecommunication operators to detect flaws in the network, network security, network optimization & offer virtual assistance. Moreover, AI enables the telecom industry to extract insights from their vast data sets and made it easier to manage the daily business and resolve issues more efficiently and also provide improved customer service and satisfaction.
The growing adoption of AI solutions in various telecom applications is driving market growth. The rising number of AI-enabled smartphones with a number of features such as image recognition, robust security, voice recognition and many as compared to traditional phones is boosting the growth of AI in the telecommunication market. Furthermore, to cater to complex processes or telecom services, AI provides a simpler and easier interface in telecommunication. In addition, growing Over-The-Top (OTT) services, such as video streaming, have transformed the dissemination and consumption of audio and video content. With more consumers turning to OTT services, consumer demand for bandwidth has grown considerably. Carrying such ever-growing traffic from OTT services leads to high operational Expenditure (OpEx) for the telecommunication industry. Hence, AI helps the telecom industry to reduce operational costs by minimizing the human intervention needed for network configuration and maintenance. However, the major restraint of the AI in telecommunication market is the incompatibility between telecommunication systems and AI technology. Contrarily, the increasing penetration of AI-enabled smartphones in the telecommunication industry, and the advent of 5G technology in smartphones are expected to provide major growth opportunities for the growth of the market. Since advancements such as 5G technology in mobile and the rising need to monitor content on the tale communication network to eliminate human error from telecommunication are driving the growth of the market. For an instance, the Chinese government trying to improve its network services and telecommunication services; hence China Telecom Corporation has started a new 5G base station in Lanzhou city. Therefore, these factors are expected to provide numerous opportunities for the expansion of the AI in telecommunication market during the forecast period.
Allied Market Research published a report, titled, “AI in Telecommunication Market by Component (Solution, Service), by Deployment Model (On-Premise, Cloud), by Technology (Machine Learning, Natural Language Processing (NLP), Data Analytics, Others), by Application (Customer Analytics, Network Security, Network Optimization, Self-Diagnostics, Virtual Assistance, Others): Global Opportunity Analysis and Industry Forecast, 2021-2031.”
According to the report, the global AI in telecommunication industry generated $1.2 billion in 2021, and is estimated to reach $38.8 by 2031, witnessing a CAGR of 41.4% from 2022 to 2031. The report offers a detailed analysis of changing market trends, top segments, key investment pockets, value chain, regional landscape, and competitive scenario.
Drivers, Restraints, and Opportunities:
Growing adoption of AI solutions in various telecom applications, the ability of AI to provide a simpler and easier interface in telecommunication and reduce the human intervention needed for network configuration and maintenance, and the growing demand for high bandwidth with more consumers turning to OTT services drive the growth of the global AI in telecommunication market. However, the incompatibility between telecommunication systems and AI technology hampers the global market growth. On the other hand, the increasing penetration of AI-enabled smartphones in the telecommunication industry, and the advent of 5G technology in smartphones likely to create potential opportunities for growth of the global market in the coming years.
- The global artificial intelligence in telecommunication market saw a stable growth during the COVID-19 pandemic, owing to the increasing digital penetration and rise in automation.
- Moreover, the pandemic led the telecommunications infrastructure to keep businesses, governments, and communities connected and operational. The social and financial disruption caused by the pandemic forced people to depend on technology such as AI for information and remote working.
- AI also helped the telecom industry to reinvent customer relationships by identifying personalized needs and engaging with customers through hyper-personalized one-to-one contacts. It also helped configure fixed-line and mobile-network bundles that combine VPN, teleconferencing, and productivity apps.
The solution segment to dominate in terms of revenue during the forecast period:
Based on component, the solution segment was the largest market in 2021, contributing to more than two-thirds of the global AI in telecommunication market, and is expected to maintain its leadership status during the forecast period. This is due to the adoption of solutions by various end users for the automated processes. On the other hand, the service segment is projected to witness the fastest CAGR of 44.9% from 2022 to 2031, due to surge in the adoption of managed and professional services.
The on-premise segment to garner the largest revenue during the forecast period:
Based on deployment model, the on-premise segment held the largest market share of nearly three-fifths of the global AI in telecommunication market in 2021 and is expected to maintain its dominance during the forecast period. This is because it provides added security of data. The cloud segment, however, is projected to witness the largest CAGR of 43.8% from 2022 to 2031, as cloud provides flexibility, scalability, complete visibility, and efficiency to all processes.
The machine learning segment to exhibit a progressive revenue growth during the forecast period:
Based on technology, the machine learning segment held the largest market share of more than two-fifths of the global AI in telecommunication market in 2021, and would maintain its dominance during the forecast period. This is because machine learning algorithms are designed to keep improving accuracy and efficiency. The data analytics segment, however, is projected to witness the largest CAGR of 46.1% from 2022 to 2031, as it helps telecom companies to increase profitability by optimizing network usage and services.
Purchase Inquiry: https://www.alliedmarketresearch.com/purchase-enquiry/9717
Asia-Pacific to maintain its leadership in terms of revenue by 2031:
Based on region, North America was the largest market in 2021, capturing more than one-third of the global AI in telecommunication market. The growth in the region can be attributed to the infrastructure development and technology adoption in countries like the U.S. and Canada. However, the market in Asia-Pacific is expected to lead in terms of revenue and manifest the fastest CAGR of 45.7% during the forecast period, owing to the growing digital and economic transformation of the region.
Leading Market Players:
- Intel Corporation
- Nuance Communications, Inc.
- Infosys Limited
- ZTE Corporation
- IBM Corporation
- Google LLC
- Salesforce, Inc.
- Cisco Systems, Inc.
The report analyzes these key players of the global AI in telecommunication market. These players have adopted various strategies such as expansion, new product launches, partnerships, and others to increase their market penetration and strengthen their position in the industry. The report is helpful in determining the business performance, operating segments, product portfolio, and developments by every market player.
Download free sample of this report at:
You may buy this report at:
Global AI in Telecommunication Market at CAGR ~ 40% through 2026 – 2027
The case for and against AI in telecommunications; record quarter for AI venture funding and M&A deals
Emerging AI Trends In The Telecom Industry
IHS Markit: Telecom Revenue +1.1%; CAPEX -1.8% in 2017
Despite unabated exponential growth in network usage, global telecom revenue is on track to grow just 1.1 percent in 2017 over the prior year, according to a new report  by business information provider IHS Markit.
Global economic growth prospects, meanwhile, are looking up. IHS Markit macroeconomic indicators point to moderate global economic growth of 3.2 percent for 2017, up from 2.5 percent in 2016, and world real gross domestic product (GDP) is projected to increase 3.2 percent in 2018 and 3.1 percent in 2019.
“Although the telecom sector has been resilient, revenue growth in developed and developing economies has slowed dramatically due to saturation and fierce competition,” said Stéphane Téral, executive director of research and analysis and advisor at IHS Markit. “At this point, every region is showing revenue growth in the low single digits when not declining, and there is no direct positive correlation between slow economic expansion and anemic telecom revenue growth or decline as seen year after year in Europe, for instance.”
China alone is tamping down global telecom capex in 2017:
IHS Markit forecasts a 1.8 percent year-over-year decline in global telecom capital expenditures (capex) in 2017, mainly a result of a 13 percent year-over-year falloff in Chinese telecom capex. Asia Pacific outspends every other region in the world on telecom equipment.
“Call it precision investment, strategically focused investment or tactical investment, but all three of China’s service providers — China Mobile, China Unicom and China Telecom — scaled back their 2017 spending plans, and the end result is another double-digit drop in China’s telecom capex bucket, with mobile infrastructure hit the hardest,” Téral said. “Bringing down capital intensity to reasonable levels of 15 to 20 percent is the chief goal of these operators.”
The virtualization trend:
A transformation is underway in service provider networks, epitomized by software-defined networking (SDN) and network functions virtualization (NFV), which involve the automation of processes such as customer interaction, as well as the addition of more telemetry and analytics with feedback loops into network operations, operations and business support systems, and service assurance.
“Many service providers have deployed new architectural options — including content delivery networks, distributed broadband network gateways, distributed mini data centers in smart central offices, and video optimization,” said Michael Howard, executive director of research and analysis for carrier networks at IHS Markit. “Nearly all operators are madly learning how to use SDN and NFV, and the growing deployments today bring us to declare 2017 as The Year of SDN and NFV.”
Data is the new oil, and AI is the engine:
Big data is becoming more manageable, and operators are leveraging subscriber and network intelligence to support the automation and optimization of their networks using SDN, NFV and initial forays into using analytics, including artificial intelligence (AI) and machine learning (ML).
“Forward-thinking operators are experimenting with how to use anonymized subscriber data and analytics to create targeted services and broker this information to third parties such as retailers and internet content providers like Google,” Téral said. “No matter their size, market or current level of digitization, service providers need to rethink their roles in the new age of information and reset the strategies needed to capitalize on this opportunity.”
Note 1. The Telecom Trends & Drivers Market Report is published twice annually by IHS-Markit to provide analysis of global and regional market trends and conditions affecting service providers, subscribers, and the global economy. These roughly 40- page reports assess the state of the telecom industry, telling the story of what’s going on now and what we expect in the near and long term, illustrated with charts, graphs, tables, and written analysis. These critical analysis reports are a foundation piece for all market forecasts.
The reports include top takeaways on the economic health of the global telecom/datacom space; regional and global trends, drivers, and analysis for the service provider network sector in the context of the overall economy; financial analysis of the world’s top 10 service providers (revenue growth, capital intensities, free cash flow, debt level); regional enterprise and carrier spending trends; top-level service provider and subscriber forecasts; macroeconomic drivers; and key economic statistics (e.g., unemployment, OECD indicators, GDP growth). The reports are informed by all of IHS Technology research, from market share and forecasts to surveys with telecom service providers and small, medium, and large businesses.
The chart below from Bharti Airtel (India’s largest telecom company) shows that telecom industry revenue has declined in 2017 Q2, Q3, and Q4 with only Q1 showing positive growth.
Optical Network Equipment Vendors:
In a service provider survey report on Optical Networking and equipment vendors, IHS-Markit found Ciena, Huawei and Nokia as the three most popular optical networking equipment vendors. The report also highlighted Data Center Interconnection (DCI) is a huge growth opportunity.
IHS-Markit predicts DCI will be a significant driver for the optical equipment market, surging from 19 percent of overall equipment sales at mid-2017 to nearly 30 percent by 2021.
Ciena was deemed the top DCI vendor by 39 percent of those surveyed by IHS-Markit. Cisco, Coriant, and Infinera each garnered 36 percent of the votes.Last year Ciena reportedly won a DCI deal from rival ADVA Optical, which had a negative impact on ADVA’s operational results.
Ciena also topped the list of top (optical) transport software-defined networking (SDN) vendors, with 46 percent of those surveyed citing the company as a leader in the segment. Adams noted that while this market was still in its early days, Ciena’s continued integration of its Blue Planet software platform with its optical equipment products was driving differentiation in the market.
Cisco attracted the second most votes in terms of transport SDN leadership, followed by Nokia and Infinera.