Supercomputing Gets a Shot in the ARM

Aug. 23, 2016

By: Michael Feldman

The prospects for another serious rival to the x86 processor in the high performance computing space are looking much better this week after ARM Holdings presented the company’s plan to offer an HPC version of its 64-bit architecture. Known as ARMv8-A SVE, the design incorporates a technology known as the Scalable Vector Extension (SVE), which will provide a unique type of flexibility with regard to vector processing -- the basis of many scientific and engineering workloads.

The architecture will enable any ARM licensee to build something akin to a Xeon Phi processor, that is, a manycore CPU with a boatload of vector processing and floating point capacity. Such a development changes the game for ARM in the HPC realm and allows it to serve as a standalone processor in a supercomputer, without the benefit of a GPU or other accelerator to provide most of the FLOPS.

The new ARMv8-A SVE design was described by ARM architect Nigel Stephens at the Hot Chips symposium, which is taking place in Cupertino, California this week. In this talk, he went through the vector extension capability in some detail, including instruction formats and how a vectorizing compiler would work in conjunction with the hardware. In a nutshell, SVE will support vector lengths between 128 and 2048 bits, although different implementations are free to support vector between these two values in 128-bit increments.

Up until now, ARM vendors looking for indigenous vector and floating point capability had to rely on the SIMD capability, known as NEON. On current 64-bit ARMv8-A processors, NEON offers 128-bit processing of fixed-point and floating point data, which in today’s world is fine for general-purpose computing, but inadequate with regard to HPC. SVE literally picks up where NEON leaves off and pushes the vector limit all the way out to 2048 bits.

But vector length is only part of the story. The idea is that an SVE compiler will auto-vectorize the code, which will slice or aggregate the application’s data into the available vector registers, and then execute the appropriate vector operations on the data in tight loops. In the cases where the vector sizes enable multiple data items to be packed into the registers and then processed en masse, runtime execution can be reduced significantly, resulting in optimal performance.

At least that’s the theory. In real life, vectorizing code can be quite tricky since not everything ends up aligning neatly on these 128-bit boundaries or is able to use simple loops to parallelize execution. In such cases, one must resort to laborious hand-coding. Nevertheless, such a flexible auto-vectorization capability is likely to be useful in many cases. And the fact that very large vector lengths of up to 2048 bits are possible offers some interesting possibilities for codes able to take advantage of it.

Keep in mind that the vector bit length for CPUs today is nowhere near 2048. For example, it’s currently 512 bits for Intel’s latest Xeon Phi, 128 bits for IBM’s Power8, and 256 bits for Fujitsu’s SPARC64 IXfx. ARM’s 2048-bit size was presumably provided as an upper limit to future-proof the technology to some extent, recognizing that the general trend in HPC has been toward longer vectors.

As Stephens pointed out in his presentation, and reiterated in a blog he posted this week, there are a number of factors that must be weighed when choosing a processor’s vector size. Those include performance/power tradeoffs of maintaining longer versus shorter vectors; the data processing characteristics of the application set that you’re running; and the demand for longer vectors sizes in the future, in both HPC and and markets adjacent to it -- machine learning, for example.

SVE will also support the what ARM is calling “vector-length agnostic (VLA) programming,” which allows you to compile your code a single time and subsequently run it on any SVE-supported platform regardless of the vector length offered in the hardware. A VLA compiler will generate the appropriate runtime code to take into account variable-sized vector hardware at runtime. If the compiler supports such a feature, it will enable a single executable to run on different ARMv8-A SVE platforms without recompilation. That may seem like overkill, but in some cases the user may not be able to recompile the application because the source code – or even the compiler -- was lost or is otherwise inaccessible.

All of this focus on vector processing might seem a bit overdone, especially considering the more urgent challenges in HPC regarding memory performance and capacity, power consumption, and application scalability. When’s the last time vector length was even mentioned in an HPC product brochure? Nevertheless, the absence of a vector unit and the associated capability of high performance floating point processing, was probably the biggest missing piece to the ARM architecture as far as its broad suitability for HPC codes.

Although the new ARMv8-A SVE architecture won’t be completely spec’ed out until later this year or perhaps early next year, the design has already has notched a significant win. Fujitsu has selected it for its Post-K exascale supercomputer, which is scheduled to be operational in 2020 at RIKEN, Japan’s premier research institution. At ISC 2016 in June, Fujitsu unveiled its plans to build a special HPC processor based on a future vector-powered ARM architecture, foreshadowing this week’s revelations. The move to ARM represented a break in Fujitsu’s tradition of using its propriety Sparc64 processors to power its top-of-the-line supercomputers.

Whether other vendors follow suit remains to be seen. It’s reasonable to speculate that Cavium and Applied Micro, the two ARM chip vendors that reliably show up at the two major supercomputing conferences, will probably license the technology at some point. ARM server processor newbies Qualcomm, Broadcom, and even AMD, could theoretically use design as an entry point into supercomputing, but Qualcomm, Broadcom currently are currently focused on more mainstream areas of the server market, while AMD is directly all its energy into its x86 and GPU lines.

And of course none of this ensures ARM a big slice of the HPC pie. But compared a few years ago when ARM Holdings began its foray into 64-bit computing, the architecture now at least looks to be a viable contender alongside OpenPower. While the ecosystem of tools, compilers, and libraries for ARMv8-A SVE, not to mention just plain ARMv8-A, has a long way to go, ARM’s licensing model is probably the most accessible in the industry. In the long run, that could turn out to be the difference.