By: Michael Feldman
Just as Intel’s newly minted Omni-Path interconnect is challenging InfiniBand for supremacy in the HPC datacenter, Mellanox is ramping up its 100G product line. The latest addition is ConnectX-5, the company’s newest adapter, which will offer MPI offloading under the banner of co-design. The new network device will come in InfiniBand and Ethernet flavors and will also include additional capabilities and performance tweaks on top of 100G speeds.
Besides the network offload goodies, which we’ll get to in a moment, the ConnectX-5 adapter also offers some additional features, namely support for PCIe 4.0 (coming in 2017), an integrated PCIe switch to attached extra devices (like say, flash memory), advanced dynamic routing, and a more flexible topology design. The adapter also includes an NVMe over Fabrics (NVMf) capability that allows flash appliances to be connected remotely and accessed via RDMA. Overall, ConnectX-5 packs a bunch of interesting new features into an existing product set.
Typically, Mellanox will introduce a new network adapter platform every couple of years. Usually this means a jump in bandwidth coinciding with a new InfiniBand generation. But occasionally the company will just refresh the hardware, as it did with ConnectX-2, the company’s second QDR InfiniBand NIC. This year’s launch of ConnectX-5 looks to be along those lines, namely an adapter upgrade of EDR and 100G Ethernet before the move to the next InfiniBand speed bump: HDR. In case you’ve forgotten, HDR will double EDR bandwidth, promising 200 gigabits/second, server-to-server.
HDR InfiniBand is expected to arrive sometime in late 2017, with 200G Ethernet to follow shortly thereafter. But the release of ConnectX-5 suggests that the launch of 200G products may slide well into 2018. Here it’s also worth noting that Summit and Sierra, two of the Department of Energy’s CORAL pre-exascale supercomputers scheduled for installation in 2017, will be equipped with Mellanox InfiniBand and are being designed with EDR technology, not HDR.
On a side note, ConnectX-5 will provide 200G throughput when systems supporting PCIe 4.0 become available next year. Whether this means there is HDR InfiniBand and Ethernet 200G technology lurking in the adapter’s ASIC, or the additional speed is result of extra links sitting on the device, is not clear. In either case, supporting 200G should extend the useful lifetime of these cards significantly.
From a competitive standpoint, ConnectX-5 is poised to go head-to-head against Intel’s Omni-Path Architecture (OPA) adapter products, which started shipping earlier this year. Like EDR InfiniBand and 100G Ethernet, the first-generation OPA adapters will support 100 gigabits/second of bandwidth between servers. Both vendors are also providing latencies in the 100ns range (for both adapters and switches, port-to-port), give or take a few tens of nanoseconds.
Such parity didn’t prevent both companies from making cross-claims of performance advantages under certain circumstances, but such numbers often don’t hold up once you get into the real world of applications running inside actual HPC machinery. So for the near-term, the competition between the two technologies will be based on price, feature set, and other ancillary capabilities.
Starting with SwitchIB-2 and continuing with ConnectX-5, Mellanox is taking a different tack, not just in technology, but in the marketing message. The new approach is geared toward optimizing overall system throughput, rather than focusing on the raw speed of the network hardware. Co-design, a paradigm that aims to increase application performance via hardware-software synergies, and which is being tapped for exascale supercomputer designs, is the key enabler of this strategy.
For networks, co-design centers on addressing the issue of data movement between servers, which is the most persistent bottleneck in the modern datacenter. The problem has been around long before “big data” arrived on the scene and will likely be a problem long after that phrase goes out of fashion. The most favored architectural solution to this dilemma involves moving the processor closer to the data.
Mellanox’s particular take on this is to relocate as much of the data processing portion of the application from the processor to the network itself, either the switch or the adapter card, leaving the CPU (or GPU or FPGA) to do actual computation. The company has established a series of APIs that allow applications to offload data aggregation and data reduction work to the switch box or the adapter card, whichever makes most sense in a particular data flow. So while data is winding its way through the network, it can be transformed without have to send it to the CPU. This not only reduces the computational load on the processor, but in many cases, actually reduces the amount of data that has to be moved around the system.
Basically, this relieves the CPU from certain mundane tasks, allowing it to devote more cycles to the more computationally-demanding algorithms in the application. According to Gilad Shainer, VP of marketing at Mellanox, that’s the approach taken in SwitchIB-2, and now ConnectX-5. “The network will become a coprocessor within a system,” he explained.
For HPC applications, Mellanox has made it possible to move most of the execution of MPI collectives and tag matching logic onto the switch or adapter card. This could indeed have a significant impact on message throughput, considering that MPI is the communication protocol used by almost every HPC cluster on the planet. At this point, Shainer thinks maybe 60 to 70 percent of MPI can now execute on their gear. “Moving forward, we’ll see the entire MPI stack being run on the network,” he said.
On a related note, ConnectX-5 delivers up to 200 million messages per second, a significant increase from ConnectX-4, at 150 messages per second. Mellanox claims this is more than twice what Omni-Path adapters can muster, although the current Intel literature advertises 160 million messages per second. Nevertheless, data throughput of this magnitude can speed HPC applications significantly, especially those running on hundreds or thousands of nodes.
More generically, any application that does data aggregation and reduction can benefit from Mellanox’s network computing capabilities. So even common workloads like parallel web searches and data analytics are fair game, since these types of data transforms are used extensively for such workloads. The benefit is not completely free, since the part of the code that manages data communications has to be modified to invoke these new network functions, but usually this logic is embedded in low-level libraries (like an MPI library) that are common to numerous applications.
In addition to network computing, an in-network memory capability is also available to store application data in convenient places in the network. Again the idea is to keep the data closer to where it is being processed and spend less time shuffling it around the system. For example, global application data that was being shared across multiple cluster nodes could be could be accessed much more quickly if it was stored on the adapter card rather than further away in the main memory of a server, or multiple servers.
All of this dovetails nicely with Mellanox’s propensity to offload the CPU to the greatest extent possible which was a feature of the company’s offerings, even before they went down the network computing path. Intel, on the other hand, relies on the CPU to run a lot more of the network processing (although not all of it), which is the result of Omni-Path’s QLogic inheritance. A decent case can be made for either approach, and both companies will undoubtedly continue to make them.
One thing is becoming clear. With two major vendors competing in the HPC interconnect space, we’re seeing a lot more interesting features being offered than we have in some time. That makes the choice network fabric more difficult than simply choosing the fastest hardware, but it also refocuses users on how their applications actually work. And that’s probably good for everyone.