International Team Builds 20-GPU Server for Deep Learning

Aug. 3, 2016

By: Michael Feldman

A server equipped with 20 overclocked NVIDIA K40 GPUs and intended for deep learning work, has been built by a team of developers from CoCoLink, a South Korean HPC provider, and Orange Silicon Valley, a California-based technology innovation group. According to the press release, the constructed prototype is capable of delivering 100 single precision teraflops and is “the world’s highest density deep learning supercomputer.”

CoCoLink Corp is a spinoff of Seoul National University and offers a variety of HPC solutions, including both coprocessor-accelerated systems and more conventional Xeon-based clusters. The Klimax-210 is the company’s flagship product and uses PCIe as the internal fabric to hook together as many as 20 graphics processors in a 4U air-cooled chassis.  Besides GPUs, the chassis can also be outfitted with Intel Xeon Phi processors and NVMe SSD devices, as well as InfiniBand adapters to build up a cluster of these servers. The company’s roadmap calls for increasingly scaled up PCIe-based boxes, culminating in a 72-slot chassis in 2020.


Image: Klimax-210. Credit: CoCoLink.


The collaboration with Orange Silicon Valley, which helped build the 20-GPU prototype, also leverages a relationship with the group’s Paris-based parent company, Orange, which provided AI software expertise. For this project, the Orange researchers integrated Caffe, the deep learning framework that is popular with many deep learning developers, and were able to scale their ImageNet training application to 16 of the K40 GPUs. Eventually they hope to scale the software to the full 20 GPUs, and beyond that to a cluster of such servers.

The developers also built a system using the latest Pascal GPUs from NVIDIA. In this case, they slotted in 10 consumer-grade GeForce GTX 1080 processors (the P100 parts are not yet generally available) into the Klimax-210 chassis. Once again they were able to overclock the GPUs, achieving an aggregate theoretical performance of 106 single precision teraflops. A fully outfitted server with 20 such GPUs would deliver something well north of 200 teraflops.

Although the Orange researchers were able to scale Caffe up to only 8 of those 10 GPUs, they realized a significant performance boost. In the process, they determined that the new Pascal GeForce could run their software about three and a half times as fast as the 2014-era Kepler K40 hardware. An ImageNet training application that the Orange researchers ran on a single K40 took 36 hours; on the 8-GPU setup using the GeForce GTX 1080, it took just 3.5 hours.

Presently there are no plans to commercialize either of these systems, although presumably one could reproduce the K40 hardware configuration by purchasing a system from CoCoLink. (They also sell servers outfitted with K80 and M40 GPUs.) The software integration and GPU scaling is the tricky part and would be application-dependent.

For their part, the developers intend to publish benchmark data on their work as they continue to refine their software. The goal is to maintain the research as an open source collaboration project for the deep learning community, inviting in other interested academia and industry partners.