TPUs Go Public on Google Cloud

Feb. 12, 2018

By: Michael Feldman

Google has announced a beta program to make its in-house Tensor Processing Units (TPUs) available to cloud customers, marking the first time a custom-built machine learning chip will be accessible to a wide array of users.


TPU servers. Source: Google


The TPU board Google is offering is built from four custom ASICs and offers up to 180 machine learning teraflops, along with 64 GB of high bandwidth memory. It is specifically designed to run the computationally-intensive machine learning algorithms that support well-known web applications such as language translation, text searching, and ad serving. According to the company, the TPU able to accelerate these workloads significantly and use a fraction of the power required by conventional GPUs and CPUs.

For the most part, these TPUs have been used internally at Google to support the kinds of web services mentioned above and to perform cutting-edge AI research, such as the AlphaGo effort. In May 2017, when the company revealed the deployment of its second-generation TPUs, it said it would make 1,000 of these boards available to machine learning researchers for free. At the time, Google also said that it would soon be offering these processors in its public cloud. That day has arrived.

As of Monday, customers can rent a TPU board for $6.50 per hour. At this point, they are available in limited quantities, so if demand overwhelms what has been allocated for the beta program, customers will have to queue up. The blog post announcing the program referenced a couple of early access customers, namely Lyft and Two Sigma. Apparently, Lyft used the TPUs to refine navigation intelligence for the company’s self-driving car program, while Two Sigma employed the chips for unspecified “deep learning research.” Both companies noted impressive speedups that were enabled by the Google TPUs.

To make the hardware more customer-friendly, Google has open-sourced a handful of machine learning models for the TPU, including ResNet-50 for image classification, Transformer for language processing, and RetinaNet for object detection. For applications outside of those domains, TPUs can be programmed at a somewhat lower level using the TensorFlow APIs Google provides.

Google’s strategy here is not entirely clear. The company has traditionally offered GPUs for its public cloud customers looking to accelerate machine learning applications (as well as traditional HPC workloads) and is currently offering NVIDIA Tesla P100 and K80 GPUS for this purpose. Google also announced that the latest V100 processors are “on the way.”

In the Google Cloud, the P100 and K80 currently rent for $1.46 and $0.45 per chip per hour, respectively (US pricing). Both, though, offer considerably less peak performance than the TPU from a machine learning perspective. For these codes, the P100 provides about 10 teraflops at single-precision floating point and 20 teraflops at half precision, and only 16 GB of high bandwidth memory. The K80 comes in at 8.73 teraflops at single precision and no high bandwidth memory. So when you look at the $6.50 per hour price for the TPU board, that actually looks like a pretty good deal, at least according to the raw numbers.

But when the V100s gets deployed, Google will have to price them rather differently, since each of these accelerators delivers about 120 machine learning teraflops – within spitting distance of the TPUs – but still just with 16 GB of high bandwidth memory. It’s a good bet that the company will rent those GPU instances out for a good deal more than the P100, but somewhat less than the TPU. But maybe not much less.

If Google’s long game is to lessen its dependence on third-part silicon for machine learning customers, it will continue to price its TPUs more favorably. Its motivation to do so is two-fold: (1) to create a differentiated cloud offering for machine learning customers that none of the other cloud providers can offer and (2) to squeeze out the middleman for its AI silicon.

For this to work though, Google needs to significantly expand software support for the TPU. In particular, it needs to offer more shrink-wrapped machine learning models for the platform, as well as provide support for additional deep learning frameworks besides TensorFlow. Ultimately though, Google’s success will hinge on making the TPU a superior platform for machine learning compared to GPUs and other custom-built processors built for these applications. It can do so by outrunning its competition with higher performance or better energy efficiency, but ideally both. It’s conceivable that the current TPU platform claims both of those titles today, but it’s not likely to last unless Google has an aggressive roadmap to keep its silicon out in front.

That might be the most difficult challenge for Google or indeed any company in the custom-built machine learning business. NVIDIA has more expertise and depth in designing these types of processors than any other chipmaker on the planet. The fact that NVIDIA has, up until now at least, chosen to put its machine learning circuitry on a more conventional GPU design doesn’t mean it will do so in the future. It’s plausible that a chip using the Tensor Core technology of the Volta architecture, and jettisoning all the double precision and graphics circuitry, could be built that would outperform a TPU. In fact, if the market demands such a chip, NVIDIA is almost sure to oblige.

For the time being, Google is probably content to run its TPU cloud offering as an experiment to gauge customer demand and gather use cases. Later this year, the company is planning to scale up the effort and offer complete TPU pods to cloud customers. A pod contains 64 TPUs connected with a custom high-speed network, delivering a stunning 11.5 petaflops. That provides a level of deep learning horsepower that can currently only be found in a handful of supercomputers sitting in government facilities like the Swiss National Computing Center and Oak Ridge National Lab.

Given Google’s enormous depth in machine learning and its boundless computing infrastructure, the TPU has the potential to be a serious competitor to GPUs.  But it surely won’t be the last. The age of purpose-built machine learning chips is just beginning.