The always excellent Hot Interconnects 2015 conference was held in Santa Clara, CA, August 26-28, 2015. This article summarizes presentations and a panel session relevant to Data Center and Wide Area Networking.
Facebook Panel Participation & Intra- DC invited talk by Katharine Schmidtke:
A 90 minute panel session on “HPC vs. Data Center Networks” raised more questions than it answered. While comprehensively covering that panel is beyond the scope of this article, we highlight a few takeaways and the comments and observations made by Facebook’s Katharine Schmidtke, PhD.
- According to Mellanox and Intel panelists, InfiniBand is used to interconnect equipment in HPC environments, but ALSO in large DC networks where extremely low latency is required. We had thought that 100% of DCs used 1G/10G/40G/100G Ethernet to connect compute servers to switches and switches to each other. That might be closer to 90 or 95%, with InfiniBand and proprietary connections making up the rest.
- Another takeaway was that ~80 to 90% of cloud DC traffic is now East-West (server to server via a switch/router) instead of North-South (server to switch or switch to server) as it had been for many years.
- Katharine Schmidtke, PhD talked about Facebook’s intra DC optical network strategy. Katharine is responsible for Optical Technology strategy at Facebook. [She received a PhD in non-linear optics from Southampton University in the UK and did post doctoral work at Stanford University.]
- There are multiple FB DCs within each region.
- Approximately 83% of active daily FB users reside outside the US and Canada.
- Connections between DCs are called Data Center Interconnects (DCIs). There’s more traffic within a FB DC than in a DCI.
- Fabric, first revealed last November, is the next-generation Facebook DC network. It’s a single high-performance network, instead of a hierarchically oversubscribed system of clusters.
- Wedge, also introduced in 2014, is a Top of Rack (ToR) Switch with 16 to 32 each 40G Ethernet ports. It was described as the first building block for FB disaggregatedswitching technology. Its design was the first “open hardware switch” spec contribution to the Open Compute Project (OCP) at their 2015 annual meeting. Facebook also announced at that same OCP meeting that it’s opening its central library of FBOSS – the software behind Wedge.
- Katharine said FB was in the process of moving from Multi-Mode Fiber (MMF) to Single Mode Fiber (SMF) for use within its DC networks, even though SMF has been used almost exclusively for telco networks with much larger reach/distance requirements. She said CWDM4 over duplex SMF was being implemented in FB’s DC networks (more details in next section).
- In answer to a question, Katherine said FB had no need for (photonic) optical switching.
Facebook Network Architecture & Impact on Interconnects:
FB’s newest DC, which went on-line Nov 14, 2014, is in Altoona, IA, which is just north of Interstate Highway 80. It’s a huge nondescript building which is 476K square feet in area. It’s cooled using outside air, uses 100% renewable energy and is very energy-efficient in terms of overall power consumption (more on “power as pain point” below). Connectivity between DC switches is via 40G Ethernet over MMF in the “data hall.”
Fabric (see above description) has been deployed in the Altoona DC. Because it “dis-aggregates” (i.e. breaks down) functional blocks into smaller modules or components, Fabric results in MORE INTERCONNECTS than in previous DC architectures.
As noted in the previous section, FB has DCs in five (soon to be seven) geographic regions, with multiple DCs per region.
100G Ethernet switching, using QSFP281 (Quad Small Form-factor Pluggable) optical transceivers, will be deployed in 2016, according to Katharine. The regions or DCs to be upgraded to 100G speeds were not disclosed.
Note 1. The QSFP28 form factor is the same footprint as the 40G QSFP+.The “Q” stands for “Quad.” Just as the 40G QSFP+ is implemented using four 10-Gbps lanes or paths, the 100G QSFP28 is implemented with four x 25-Gbps lanes or paths.
Cost efficient SMF optics is expected to drive the price down to $1/Gbit/sec very soon. SMF was said to be “future proofing” FB’s intra DC network2, in terms of both future cost and ease of installation. The company only needs a maximum reach of 500m within any given DC, even though SMF is spec’d at 2km. Besides reach, FB relaxed other optical module requirements like temperature and lifetime/reliability. A “very rapid innovation cycle” is expected, Katharine said.
Note 2. Facebook’s decision to use SMF was the result of an internal optical interconnects study. The FB study considered multiple options to deliver greater bandwidth at the lowest possible cost for its rapidly growing DCs. The 100G SMF spec is primarily for telcos as it supports both 10Km and 2Km distances between optical transceivers. That’s certainly greater reach than needed within any given DC. FB will use the 2Km variant of the SMF spec, but only up to 500m. “If you are at the edge of optical technology, relaxing just a little brings down your cost considerably,” Dr. Schmidtke said.
A graph presented by Dr. Schmidtke, and shown in EE Times, illustrates that SMF cost is expected to drop sharply from 2016-to-2022. Facebook intends to move the optical industry to new cost points using SMF with compatible optical transceivers within its DCs. The SMF can also be depreciated over many years, Katharine said.
FB’s deployed optical transceivers will support Coarse Wavelength Division Multiplexing 4 (CWDM4) Multi-Source Agreement over duplex SMF. CWDM4 is a spec for 4 x 25G Ethernet modules and is supported by vendors such as Avago, Finisar, JDSU, Oclaro and Sumitomo Electric.
CWDM4 over duplex SMF was positioned by Katharine as “a new design and business approach” that drives innovation, not iteration. “Networking at scale drives high volume, 100s of thousands of fast (optical) transceivers per DC,” she said.
Other interesting points in answer to audience questions:
- Patch panels (which interconnect the fibers) make up a large part of Intra DC optical network system cost. For more on this topic, here’s a useful guide to fiber optics and premises cabling.
- Power consumed in switches and servers can’t keep scaling up with bandwidth consumption. For example, if you double the bandwidth, you CAN’T double the power consumed! Therefore, it’s critically important to hold the power footprint constant as the bandwidth is increased.
- More power is consumed by the Ethernet switch chip than an optical transceiver module.
- Supplying large amounts of power into a mega DC is the main pain point for the DC owner (in addition to the cost of electricity/power there are significant cooling costs as well).
- FB is planning to move fast to 100G (in 2016) and to 400G Ethernet networks beyond that time-frame. There may be a “stop over” at 200G before 400G is ready for commercial deployment, Katharine said in answer to a question from this author.
Recent Advances in Machine Learning and their Application to Networking, David Meyer of Brocade:
This excellent keynote speech by David Meyer, CTO & Chief Scientist at Brocade, was very refreshing. It demonstrated that real research is being done by a Silicon Valley company other than Google!
Machine learning currently spans a wide variety of applications, including perceptual tasks such as image search, object and scene recognition and captioning, voice and natural language (speech) recognition and generation, self-driving cars and automated assistants such as Siri, as well as various engineering, financial, medical and scientific applications. However, almost none of this applied research has spilled over into the networking space. David believes there’s a huge opportunity there, especially in predicting incipient network node/link failures. He also talked about Machine Learning (ML) tools for DevOps/ network operations (see below).
- OpenConfig (started by Google) aims to specify a vendor neutral/independent configuration management system. That management system has a big ML component from a telemetry configuration model.
- OPNFV consortium is specifying Operating System components to realize a Network Function Virtualization (NFV) system. There’s a Predictor module that includes an intelligence training system.
- One can envision a network as a huge collection of sensors that form a multi-dimensional vector space. The data collected is ideal for analysis/learning via deep neural networks.
- There are predictive and reactive roles for ML in network management and control.
- “We are at the beginning of a network intelligence revolution,” David said.
- ML tools for DevOps: domain knowledge is needed from an analytics platform, which should include a recommendation system.
- Application profiling was cited as an example to build tools for a DevOps environment: 1] Predict congestion for a given application. 2] Correlate with queue length to avoid dropped packets. 3] Anomaly detection of a pattern that doesn’t conform to expected behavior (if that behavior can be defined?)
Future of ML – What’s Next:
- Deep neural nets that learn computation functions.
- More emphasis on control- analyze sophisticated time series.
- Long range dependencies via reinforcement learning.
- Will apply to compute, storage, network, sensors, and energy management.
- Huge application in networking will be predictive failure analysis (and re-route BEFORE the failure actually occurs).
3. Software Defined WANs- a tutorial by Inder Monga of ESnet & Srini Seetharaman of Infinera
This was a terrific “tag team” lecture/discussion by Inder & Srini who alternated describing each slide/diagram. We present selected highlights below.
Inder summarized many fundamental problems in all facets of WANs:
- Agility requirements are not met for WAN provisioning (sometimes takes days or weeks to provision a new circuit or IP-MPLS VPN)
- Traditional wide-area networking is inflexible, opaque and expensive
- WAN resources are not efficiently utilized (over-provisioning prevails)
- Interoperability issues across vendors, layers and domains reduces chance of automation
- Hard to support new value propositions, like: Route selection at enterprises, Dynamic peering at exchanges, Auto bandwidth and bandwidth calendaring, Mapping elephant (very large) data flows to different Flexi-Grid channels
Srini commented that the Network Virtualization (NV)/ overlay model has more market traction than the pure SDN/Open Flow model.
Overlay networks run as independent virtual networks on top of a (real) physical network infrastructure. These virtual network overlays allow cloud service and DC providers to provision and orchestrate networks alongside other virtual resources (like compute servers). They also offer a new path to converged networks and programability. However, network overlays shouldn’t be confused with “pure SDN” which doesn’t permit overlays or network virtualization. [We’ve previously described both of these “SDN” approaches in multiple articles at viodi.comand techblog.comsoc.org]
Several vendors provide NV software on compute servers running in DCs (e.g. VMWare, Nuage Networks, Juniper, etc). They support VxLAN for tunneling L2 frames withing a DC network (in lieu of VLANs) and then map VxLAN frames to IP-MPLS packets for inter DC transport. However, none of those NV software vendor’s inter-operate with other vendors on an end to end basis. That confirms again that at least the NV version of SDN is not really “open,” as the same vendor’s NV software must be used on the compute servers.
Gartner Group finds that SDN in general (including all the myriad versions, twists and tweaks), is approaching the bottom of the “trough of disillusionment” after falling hard from the peak of inflated expectations that was built up due to all the hype and BS. This is illustrated in the graph below:
It’s interesting to note that SD- WANs, which have a much broader connotation than SDN for WANs, continue to ramp up the innovation trigger curve. They’ve yet to reach their peak of excitement and/or hype. White box switches, which we think is the future of true open networking, is on the downward path towards disillusionment, according to Gartner.
We totally disagree as we see years of tremendous potential ahead for open networking software running on bare metal switches (made by ODMs in China and Taiwan).
In closing, we note that National Research & Education Networks (NRENs) have deployed an East-West interface for multi-domain SDN – something we’ve screamed was missing from ONF specified SDN specs for a long time! Please refer to Dan Pitt’s remarks on that topic during my interview with him at the 2015 Open Networking Summit.
The NREN East-West/multi-domain interface is evidently based on a Network Services Interface (NSI) spec from the Open Grid Forum.
The OGF- NSI document Introduction states:
“NSI is designed to support the creation of circuits (called Connections in NSI) that transit several networks managed by different providers. Traditional models of circuit services and control planes adopt a single very tightly defined data plane technology, and then hard code these service attributes into the control plane protocols. Multi-domain services need to be employed over heterogeneous data plane technologies.”
Kuddos to Inder and Srini for looking through all the marketing hype, identifying WAN problems and some potential solutions that might be solved by new software. The one that I’m most enthusiastic about is theOpenConfig project (described above in the Machine Learning section) for vendor neutral configuration. It’s purpose and functions are described in this tutorial article.