Habana Labs emerged from stealth mode this week with the announcement of its custom-built AI inference processor that can outrun the fastest GPUs.
According to the company’s internal testing, its new Goya HL-1000 chip delivered a world-record 15,000 images per second inferencing a trained Resnet-50 neural network (batch size = 10), with an average latency of just 1.3 ms. The closest numbers for a GPU is provided by NVIDIA’s Tesla V100, which reaches 2,685 images per second, with a latency of 3.0 ms (batch size = 8). Even with batch size of 128, the V100 could only manage 6,275 images per second, while increasing latency to 20 ms. To maintain a reasonable level of interactive response, latency for most applications should ideally be no higher than 7 ms,
The Pascal-generation Tesla P4 GPU, which is specifically built for inferencing, achieves 1,291 images per second and a latency of 6.2 ms inferencing a ResNet-50 model (batch size = 8). Its successor, the just-announced Tesla T4 GPU, doesn’t have hard numbers yet for ResNet-50 inferencing, but on paper at least, it’s nearly six times as powerful in raw INT8 performance compared to the P4 (130 terops vs. 22 teraops).
Of course, one of the advantages of the Goya processor is that’s it designed entirely for deep learning inference; there are no legacy bits of graphics logic. According to the company, the chip is a SIMD processor that supports data types that the Habana designers deemed most useful for inference: FP32, INT32, INT16, INT8, UINT32, UINT16, and UINT8. The die is comprised of eight Tensor Processing Cores (TPC), each of which has its own local memory, as well as access to on-chip shared memory. External memory is accessed via a DDR4 interface.
The Goya processor can be programmed with Habana’s SynapseAI toolkit, which provides a C API for describing the neural network to be executed and a Python API to load an existing deep learning framework. All the major frameworks are supported including TensorFlow, MXNet, Caffe2, Microsoft Cognitive Toolkit, PyTorch and Open Neural Network Exchange Format (ONNX). Once a trained model is loaded, it’s converted into an internal representation and optimized for the Goya hardware.
The company says the hardware is not limited to specific workloads or application domains, but it’s not clear how much software support the toolkit currently provides. At this point, Habana lists vision, neural machine translation, sentiment analysis, and recommender systems as examples of models that have been executed on Goya.
The processor will be initially available as 200-watt PCIe card that comes with 4, 8, or 16GB of DDR4 memory. The 200-watt thermal limit is a bit high for a scale-out server environment, but for the ResNet scenario described above, the company is claiming a power draw of just 100 watts.
Habana is currently sampling Goya for select customers. It plans to do the same for Gaudi, its first AI training processor, in the second quarter of 2019. For this chip, the company is promising that training performance will “scale linearly to thousands of processors.” Stay tuned.