NVIDIA Unveils New Inferencing GPU for Datacenter

At GTC Japan, NVIDIA announced the Tesla T4 GPU, the company’s first datacenter product that incorporates the capabilities of the Turing architecture. To go along with the new hardware, the GPU-maker is also releasing enhanced TensorRT software.



As we predicted last month, when NVIDIA launched the first Turing GPUs in its Quadro line, the company is now using the architecture to juice up its Tesla portfolio, in this case, in the form of the Tesla T4 GPU. Since one of the major Turing innovations is a much higher capacity for deep learning inference via the hardware’s enhanced INT8 and INT4 capabilities, it makes sense that the first Turing-based Tesla GPU would be aimed at this type of workload.

As has been evident for some time, there is a large and growing market for deep learning inference, driven principally by the big web service providers -- Google, Amazon, Microsoft, Facebook, et al. Unlike the training aspect of deep learning, inferencing is a volume business since it is used to repeatedly query already-trained neural networks. NVIDIA believes this inferencing market will grow to $20 billion in the next five years.

The new T4 GPU supersedes the P4 GPU, a Pascal-generation product introduced in 2016, aimed at this same market. (NVIDIA skipped a Volta-based inferencing GPU.)  The T4 comes with 2,560 CUDA cores, along with 320 Turing Tensor Cores, which are the ones purpose-built for deep learning computations. Its peak single precision floating point (FP32) performance is a respectable 8.1 teraflops, while its mixed precision (FP16/FP32) performance is an impressive 65 teraflops.

Where it really shines, though, is in 8-bit integer (INT8) and 4-bit integer (INT4) computations, which are the ones often used for inferencing. In this capacity, the T4 can deliver 130 teraops for INT8 and 260 teraops for INT4, keeping in mind that only a subset of neural networks can be effectively inferenced with 4-bit precision. NVIDIA is claiming the T4 can process inference queries 20 to 40 times faster than a CPU, specifically, an Intel Xeon Gold 6140 processor.

The T4 PCIe card is equipped with 16 GB of GDDR6 memory, yielding over 320 GB/sec of bandwidth. Note that GDDR6 is something of a compromise here, inasmuch as the more performant HBM2 stacked memory that comes standard on the Tesla GPUs for training would have added additional cost. Unfortunately, GDDR6 memory tends to draw more power than HBM2 on a capacity basis. That said, NVIDIA has managed to get the entire T4 package into 75 watts, which makes it suitable for the kinds of scale-out servers that this product is geared for.

On the software side, NVIDIA has released version 5 of its TensorRT software, a set of libraries and tools for deep learning inference. Significantly, this latest version supports the new Turing Tensor Cores and adds optimizations for multi-precision workloads. In conjunction with this software, NVIDIA is also providing a TensorRT inference server, which it describes as “a containerized inference microservice that maximizes GPU utilization and seamlessly integrates into DevOps deployments with Docker and Kubernetes.” It can be had for free on the NVIDIA GPU Cloud’s container registry.

One of the first customers for the new T4 GPU is Google, which will make it available on a trial basis via an early access program. “AI is becoming increasingly pervasive, and inference is a critical capability customers need to successfully deploy their AI models,” said Chris Kleban, product manager at Google Cloud, “so we’re excited to support NVIDIA’s Turing Tesla T4 GPUs on Google Cloud Platform soon.”

Microsoft is also on record as an early adopter, despite the fact that the company has built out a vast infrastructure of FPGA-powered servers on Azure to perform at least some of its web service inferencing. “We are working hard at Microsoft to deliver the most innovative AI-powered services to our customers,” said Jordi Ribas, corporate vice president for Bing and AI Products at Microsoft. “Using NVIDIA GPUs in real-time inference workloads has improved Bing’s advanced search offerings, enabling us to reduce object detection latency for images. We look forward to working with NVIDIA’s next-generation inference hardware and software to expand the way people benefit from AI products and services.”

Although there are no T4 GPU-equipped servers available as of yet, NVIDIA has lined up an array of OEMs that are preparing make them, including HPE, Dell EMC, IBM, Fujitsu and Supermicro. Cisco, a new entrant into the deep learning business, has also voiced its support, saying they plan to make T4-powered UCS servers available to their customers.

NVIDIA also used the GTC Japan event to announce a couple significant Tesla GPU wins. The first is the deployment of a DGX-2 supercomputer cluster by Fujifilm, which will employ the system’s V100 GPUs to improve medical imaging diagnostics and support the company’s efforts in developing materials for displays and fine chemicals. The second is a planned NVIDIA GPU-based cloud installation by NTT Group (Nippon Telegraph and Telephone), the world’s fourth largest telecom company. In this case, NTT is building a V100-powered AI resource, which will support natural language processing, traffic analysis and control, healthcare analytics, and network intelligence functions.

The GPU-maker also used the GTC gathering to unveil a bunch of other developments related to robotics, autonomous machines, and medical instruments. For details on those, check out the NVIDIA newsroom and its related blog site.

Current rating: 5