The rollout of 200 Gbps networking began in earnest this week with Mellanox’s unveiling of its initial HDR InfiniBand portfolio of Quantum switches, ConnectX-6 adapters, and LinkX cables. Although none of the products will be available until 2017, the imminent move to 200 Gbps is set to leapfrog the competition, in particular, Intel, with its current 100 Gbps Omni-Path technology.
The doubling of the bandwidth from the current EDR product set will be welcome boost for many HPC workloads that are currently bound by network speeds. Among these are applications in cosmology, physics simulations, homeland security, and brain modeling. In fact, some of these users are already looking forward to even higher speeds than HDR, according to Mellanox marketing VP Gilad Shainer. “We’re already starting to work on 400 gigabits per second,” he says.
Shainer also pointed to machine learning as an area that will benefit from the HDR bandwidth, given its demand for high levels of node-to-node communication. Baidu, Flickr, Facebook, Netflix and others have already installed InfiniBand in most, if not all of their machine learning clusters today, according to him. “InfiniBand will become, as we see today, the de facto solution in those areas,” Shainer maintains.
Looking at the raw numbers, the new hardware is impressive. The HDR InfiniBand switch, known as Quantum, will provide 40 ports of 200 Gbps or 80 ports of 100 Gbps. At the higher speed, it will support up to 390 million messages/second per port. That’s more than two times the capacity of the current EDR switch, which provides 36 ports at 100 Gpbs. Latency is on the order of 90 microseconds, which is marginally better than the 100-110 microseconds of the current Intel Omni-Path switches.
If you use the 100 Gbps capability in the Quantum switch, you can hook together up to 128,000 nodes in a fat-tree topology with just three levels of switches. The bottom line is that you can reduce the number of switches and cables by half if you take advantage of the 80-port set-up, saving money in both up-front and operational cost. And if you need the higher speed for your application, the advantage of faster throughput can have its own value.
The new adapter product is ConnectX-6, and unlike the switch discussed above, supports both 200 Gbps for InfiniBand and Ethernet (so presumably a 200 Gbps Ethernet switch will be forthcoming, but let’s not get ahead of ourselves). The adapter will support 200 million messages per second, while boasting an end-to-end latency of just 0.6 microseconds
The tricky part of the faster adapter speed is the limitation of the PCI Express bus. Although ConnectX-6 supports both PCIe Gen 3 and Gen4, at this point the only chipmaker that has announced support for Gen4 is IBM -- in its Power9 processors. PCIe Gen 4 will support 200 Gbps with a typical 16-slot (x16) setup, but PCIe Gen 3 will need a 32 slot (x32) interface, which is almost unheard of. So Intel Xeon-based servers will need that extra-wide PCIe interface, at least until the chipmaker moves up to PCIe Gen4. Same goes for AMD and Zen-based servers that appear in 2017. Presumably Mellanox and their OEM partners are working this out.
That’s not going to be an issue in the upcoming Sierra and Summit supercomputers that are being installed by the Department of Energy toward the end of 2017 through the beginning of 2018. Those are Power9/Volta GPU-based systems and are now expected to use Mellanox’s HDR InfiniBand to hook the nodes together. Give the computational density of those nodes, it’s a good thing they will have 200 Gbps bandwidth to feed those processors.
Besides the doubling in speed and the higher radix switch, the new switch and adapter hardware also supports in-networking computing, the application offload capability they started adding in the 100 Gbps products. In fact, the company thinks this is a bigger deal than any temporary bandwidth advantage they may enjoy over their competition, since this kind of capability is unique to Mellanox and can speed up application performance by a factor of 10 or more.
The general idea is to run as much of the generic data processing as possible in the adapters and switches. That encompasses low-level code like MPI collectives and tag matching, background checkpointing, and encryption protocols, among others. “We’re moving around 60 to 70 percent of MPI to the network,” says Shainer. “In the future, we’ll probably see 100 percent of MPI moved to the network.”
The rationale is that if you lighten the load on the server’s processors, they can devote all their cycles to the application. At the same time, throughput can be made more efficient, since the data can be processed locally in the network, without having to send it all the way to the server end-point.
Product availability for the switches, adapters, and cables is slated for 2017, although Mellanox is not saying anything more specific than that. If you want to see the technology in action sooner, and you’re in Salt Lake City next week, stop by the SC16 conference. Company reps will be demonstrating the HDR gear in their booth.