Nov. 21, 2016
By: Michael Feldman
At SC16, during a birds-of-a-feather (BoF) session held Wednesday afternoon, a room full of supercomputing enthusiasts listened attentively to the latest developments at the Exascale Computing Project (ECP). Department of Energy (DOE) representatives were on hand to deliver updates on the software and hardware efforts that the project is undertaking.
The ECP is the US DOE program to develop supercomputers for the post-petascale era. It owes its existence to the Obama administration’s National Strategic Computing Initiative (NSCI), the focus of which is to extend the nation’s high-end computing capabilities. The ECP’s specific role is to manage the R&D effort to accelerate the needed hardware and software technologies so that when the DOE starts ordering up their initial exascale machinery in few years, the vendors will be ready to deliver.
With regard to those initial systems, there’s no particular news. According to ECP Director Paul Messina, the deployment timeline for the US remains the same. The first exascale systems are expected to be installed in 2022, with the idea they will go operational in 2023. Specs are the same as well: an exaflop or better performance on real applications, an upper power limit of 20 to 30 MW, and an average application fault rate of 6 days or more. In addition, the program remains committed to developing two distinct hardware architectures – a number that really only makes sense when you consider that the current US HPC landscape is split between Intel and the IBM-NVIDIA-Mellanox troika.
The systems are being developed under the co-design paradigm, where software and hardware are designed in a holistic manner. ECP made its first big investment in that direction just recently, announcing funding for four exascale co-design centers. The $48 million award will be spread out over the four centers over a period of four years. The centers are being established at some of the most prominent DOE national labs, namely, Argonne, Lawrence Berkeley, Los Alamos, and Lawrence Livermore.
Another pot of ECP money was awarded recently to develop the exascale software stack. In this case, $34 million will be dispersed across 35 software development efforts at national labs and at universities. The software stack includes items like developer tools, programming models, runtime libraries, data management frameworks, I/O, analytics, visualization, and low-level system software. A couple of examples are the Exascale MPI work being done at Argonne National Laboratory and the Exascale Code Generation Toolkit being developed at Lawrence Livermore and Pacific Northwest national labs, plus Ohio State and Colorado State. The complete list is here.
The challenges to the software stack are considerable and are related to the expected size and architectural complexity of the systems. But according Rajeev Thakur, ECP Software Technology Director, “the main challenge is the co-design and integration of various components of the software stack with each other, with a broad range of applications, with emerging hardware technologies, and with the software provided by system vendors.”
Follow-on funding on the software stack is already anticipated. Over the next few months the software technology team is going to figure out what’s missing from the stack and issue a new round of RFIs and RFPs to close any gaps that are apparent.
The hardware R&D, which is being done under the PathForward program, is also moving along. According to Jim Ang, ECP Hardware Technology Director, they are currently evaluating a number of proposals from vendors and will announce the selected winners in early 2017. This initial work will be to come up with conceptual system designs and demonstrate technologies that can deliver additional performance beyond that anticipated in current vendor roadmaps.
The biggest technical challenges on the hardware side are energy efficiency, resiliency and data bottlenecks, especially at the memory interface. Another complication for the hardware designers is the co-design aspect, given that the architecture is being influenced by the application and system software development work. The fact that all of these efforts – hardware, application software, and system software -- are in motion, simultaneously creates a three-way synchronization challenge.
At the completion of the PathForward design work toward the end of 2018, funding for non-recurring engineering (NRE) efforts will be awarded to vendors involved in delivering the first systems. Although the ECP itself is not involved in the procurement of the initial exascale systems, it will provide funding to close any gaps in the vendors’ product roadmaps that will be required to meet the first exascale deliveries.
It’s not completely clear if the vendors actually need a funding boost from the government to fast-track their roadmaps. The surging demand for things like machine learning, hyperscale infrastructure, and mobile computing is accelerating some of these technologies organically. On the other hand, their full development and integration into an exascale platform is likely to require some government backing, given that vendors will only sell a handful of such systems over the first few years. As ECP Exascale System Director Terri Quinn put it at the SC16 BoF: “we’re going to sweeten the pot.”
Most of the ECP funding has yet to be spent. The general consensus is that will take something like $1 billion for the entire project, which is what the Japanese intend to spend on their Post-K exascale system scheduled for deployment in 2021 or 2022. Given the timeframe involved for the US effort and manner in which federal funding is allocated, all of the ECP money is not in place yet.
There is some concern about how this funding would proceed over the next four years during the Trump administration. Someone from the audience did raise that concern at the SC16 BoF, but Messina was cautiously optimistic and was going to take it one day at a time. He pointed at that in general there has been bipartisan support for these types of programs.
From our perspective here at TOP500, we’re guessing that Trump is going to think this whole exascale thing is just “terrific.”