Managing HPC Failures Takes Forethought

June 19, 2016

By: Andrew Jones, NAG Group

Things do go wrong. I was recently on a train journey from Liverpool to London. Normally a two-hour direct service, a flooded line resulted in a five-hour excursion, by which time the meeting had finished without me. At the time, with frequent information announcements, passable WiFi (paid), free drinks proactively distributed (water only), food available (paid), and most importantly, electricity sockets, the five hours passed with less distress than I might have expected in hindsight. The lesson is that when things go wrong, what matters is how they are dealt with.

HPC systems, and their life support environment of datacentres, people and associated infrastructure, are no different. They form part of a service to their user communities. Even in an explicitly stated “non-production” service, users still develop an expectation of availability, performance, and other features. Such systems are complex with many interacting components, often primarily designed for other markets, but which have been evolved for duty in HPC. These systems usually add scale to the mix as well, in the sense of sheer number of components. In addition, many users are seeking to use the system near its limits of performance. It is no surprise, then, that these machines fail. Indeed, within this context, HPC systems might be expected to fail more often than they actually do.

The relative stability of such complex systems at scale is testament to the designers of HPC systems. This includes those who design the components and system-level products, and those who architect these into deployed capability for users. Of course, like anything else, this resilience has a cost impact. Building resilience and reliability into components, systems and HPC services costs extra. This takes the form of investments in R&D, manufacturing costs, increased power consumption, over-provisioning, and increased I/O for checkpoints, to name a few.

In my HPC consulting work, I do encounter (although not often) explicit consideration of different resilience and reliability options – different quality of components vs. over-provisioning vs. service recovery mechanisms, and so on However, I have not yet seen a study measuring the actual utilization of those resilience features against the costs involved. I have no reason to doubt the “normal” balance is right. HPC systems do fail, but not so often as to significantly impair overall user productivity or the reputations of said systems.

There are two aspects to managing failure in HPC services. The first, as discussed so far, is designing and accepting an appropriate level of resilience and reliability. The second is communication with users and other stakeholders, especially the funders. This includes communicating what level of reliability and availability the service is designed for -- setting expectations is crucial -- and where necessary, providing training and guidance for the users as to how they can additionally protect their data and workflows.

It also includes communicating with users and other stakeholders when system failures do occur. Like my train ride, effective communication goes a long way to smoothing the experience for those affected. This means conveying information on what the problem is, and estimating of how long before normal service is resumed, even if qualified as a “best guess.” This is especially true if backed up with covering immediate needs, in my train example, WiFi, water and electricity, but for affected HPC users, it might be help finding and using alternative facilities.

Planning around failures in HPC doesn’t draw the same community and media attention as the latest processor wars or the biggest system announcements, but I think it is one of those aspects that quietly sustains the positive reputation of HPC with users. I look forward to discussions at ISC 2016 with HPC technology providers, architects and service managers around the resiliency and reliability features and their cost and utilization.