IBM researchers claim they have come up with a much more efficient model for processing neural networks, using just 8 bits for training and only 4 bits for inferencing. The research is being presented this week at the International Electron Devices Meeting (IEDM) and the Conference on Neural Information Processing Systems (NeurIPS).
In a nutshell, IBM will demonstrate custom-built hardware sporting reduced precision processing units, as well as new algorithmic techniques able to leverage that hardware for training and inferencing deep neural networks (DNNs). The primary goal is to increase the energy efficiency of the hardware such that it can be applied to a wider scope of AI solutions. From the press release/blog:
The coming generation of AI applications will need faster response times, bigger AI workloads, and multimodal data from numerous streams. To unleash the full potential of AI, we are redesigning hardware with AI in mind: from accelerators to purpose-built hardware for AI workloads, like our new chips, and eventually quantum computing for AI. Scaling AI with new hardware solutions is part of a wider effort at IBM Research to move from narrow AI, often used to solve specific, well-defined tasks, to broad AI, which reaches across disciplines to help humans solve our most pressing problems.
Specifically, IBM Research is proposing hardware that offers 8-bit floating point (FP8) precision for training neural networks, which is half the 16-bit precision (FP16), which has been the de facto standard for DNNs work since 2015. (The proposed hardware will rely on FP16 for accumulating the dot products, rather than FP32 that is used now.) With the help of new algorithmic techniques, which we’ll get to in a moment, the IBM researchers say that they can maintain accuracy across a variety of deep learning models. In fact, they have documented the training of deep neural networks based on image, speech and text datasets using FP8 precision, achieving model accuracy on par with that of FP32-based training.
The reduced precision model is based on three software innovations: a new FP8 format that allows matrix multiplication and convolution computations used for DNN training to work without loss of accuracy; a “chunk-based computation” technique that enables neural networks to be processed using only FP8 multiplication and FP16 addition; and the use of floating-point stochastic rounding in the weight update process, allowing these updates to be computed with 16 bits of precision rather than 32 bits.
The hardware they are showcasing this week is a 14nm processor based on a “novel dataflow-based core.” It’s comprised of reduced precision dataflow engines, 16-bit chunk accumulation engines and on-core memory and memory access engines. The researchers claim that this design has the potential to deliver a two- to four-fold improvement in training compared to today’s platforms. Some of this improvement is the result of a 2x reduction in the bit width used to train the models, but the rest is due to the software techniques used to exploit the reduced precision.
Perhaps more significantly, IBM Research says that because less memory bandwidth and storage is needed for their FP8/FP16 model versus the standard FP16/FP32 model and because their hardware is custom-built for processing these neural networks, energy efficiency can be improved by greater than 2x to 4x. This, say the researchers, will enable DNN models to be trained on some edge devices, rather than just in datacenter servers.
The researchers have also published a paper on using 4-bit inferencing across a number of deep learning applications, again without loss of accuracy. (Most inferencing these days is based on computations using 8 or more bits.) The significance here is that once again the reduction of bit width will improve throughput and energy efficiency. The need for reduced precision also makes it more natural to build a unified architecture for training and inferencing, based on bit precision optimized during the training phase. According to the researchers, such hardware can deliver a super-linear improvement in inferencing performance thanks to the reduction of area on the processor devoted to computation plus the ability to retain the model and activation data in memory.
A related area of research has to do with applying this reduced precision model to analog chips, which are inherently less precise than their digital cousins but a great deal more energy efficient. IBM researchers have developed an 8-bit analog accelerator using phase-change memory (PCM) that can act as both the computational substrate and storage medium for processing neural networks. Based on work revealed earlier this year, IBM Research has implemented a novel addition to the technology, called projected PCM (Proj-PCM), which mitigates some of the annoying imprecision of PCM hardware. The research team believes the design can deliver high levels of performance for AI training and inferencing in power-constrained environments like IoT and edge devices.
Although all of this is still in the research phase, IBM is clearly interested in building their own AI chips and accelerators and getting them in the hands of their customers. How they plan to commercialize the technology remains to be seen though. Regardless, if reduced precision training and inferencing catches on, IBM will have plenty of competition – not just from industry stalwarts like Intel and NVIDIA, which will adapt their own processor platforms accordingly, but also from AI chip startups, which seem to sprout up daily. In such a rapidly changing environment, success will favor the most nimble.