By: Michael Feldman
Researchers using the Department of Energy’s Cori supercomputer have broken the 10-petaflop barrier on two separate deep learning applications, one of which attained a peak throughput of 15 petaflops.
The feat was performed by a team of scientists and engineers from Lawrence Berkeley National Laboratory, Stanford University and Intel Corporation, who documented their accomplishment in a research paper published on August 17. The paper’s authors state that this is “the first 15-petaflop deep learning system for solving scientific pattern classification problems on contemporary HPC architectures.”
In this case, those petaflops are of the single precision (SP) variety, since 32-bit floating point operations will generally provide as much accuracy as you need to train neural networks. Fortunately, Cori can deliver these flops in in abundance. The system, a Cray XC40 supercomputer, equipped with 9,688 Intel Xeon Phi 7250 (Knights Landing) processors, and has a peak performance is 59 SP petaflops. However, according to the authors, continuous use of the advanced vector extension (AVX) facility that is employed for deep learning matrix operations will drop the clock speed from 1.4 to 1.2 GHz, yielding a sustained performance of 50.6 SP petaflops.
The application that achieved the highest flops yield was one that extracted meteorological patterns using a 15TB dataset generating from climate simulations. Without getting too deep into the nitty-gritty, the software employed a semi-supervised approach, using a combination of a fully supervised convolutional network and an unsupervised convolutional autoencoder. This model yielded a peak throughput of 15.07 petaflops and sustained throughput of 13.27 petaflops, utilizing 9,622 of Cori’s Xeon Phi processors.
The other application used a convolutional neural network for discriminating signals in high-energy physics (HEP) data. In this case, the training data was based on 10 million images generated by physics simulations, so the problem was basically reduced to a binary image classification task. The HEP classification obtained a peak throughput of 11.73 petaflops and a sustained throughput of 11.41 petaflops, in this case, using 9,600 Xeon Phi processors.
Besides the impressive flop yields, application scalability was also impressive. For the climate neural network, a 7,205-fold speedup was attained when scaling the problem from a 1 to 9,622 Xeon Phis. For the HEP model, a 6,173-fold speedup was realized when going from 1 to 9,600 of these same processors.
Note that for a single Xeon Phi, both these applications used about two SP teraflops out of the six SP teraflops available on the Xeon Phi 7250. While that might not seem like a great yield, it’s about as good as it gets for most real-world applications for either traditional HPC codes, deep learning, or whatever. Usually, only artificial benchmarks, like Linpack, or applications that spend almost their entire time doing dense matrix math, would get better yields.
Of course, accuracy of the results, rather that flops or peak throughput are the ultimate goal. By this measure, the applications seemed to do rather well. For the HEP classification problem, the software delivered a hit rate of up to 72 percent in detecting signals – compared to just a 42 percent hit rate for the associated benchmark analysis. For the climate science application, there were no benchmarks to compare the results against, but the authors wrote that the software “does a good job of localizing and identifying tropical cyclones.”
The authors have a few areas they believe are worth additional study for scaled-out deep learning, including dealing with runtime variability due to node failures, optimizing batch sizes for different frameworks, and evaluating newer deep learning kernels using Winograd and FFT-based algorithms. They also think there is a lot of opportunity to be had when the next generation of supercomputers, comes online, many of which will support even lower-precision computation. “We believe that such systems have the potential to further accelerate training time for our applications,” they write.
For a fuller account of the work, download the entire 12-page research paper.