How to Measure HPC


Measuring high performance computing can be very powerful for the businesses that rely on it and the end users that directly employ it. Based on NAG’s experience helping organizations with HPC measurement, we have put together this overview of the subject for TOP500 News.

Measuring usage and costs of your current HPC capability can inform service delivery and policies to extract the optimum science or business impact from your existing set-up. Collecting data on the usage patterns, costs, and value delivered by current HPC capability can also help ensure future decisions are optimal. This enables that subsequent investments can be made with confidence, at the right scale, type and timing for maximum value, and with well understood risks. However, measuring the right things, and making sure they drive the best business or science impact is a surprisingly complex undertaking.

 

 

Measuring technical performance seeks answers to basic questions such as: How fast is this supercomputer? How fast is my code? How does my code scale? Which system or processor is faster? The answers are delivered through benchmarking. The process involves running well-understood test cases under specified conditions or rules and recording the appropriate performance data and associated metadata (for example, time to completion, system configuration, code build details, library versions used, system core count, node population, and network topology, to name just a few).

Benchmarks are only specific scenarios, but they are usually analyzed to extrapolate or infer more general behavior. This might include predicting the performance of a potential hardware upgrade or a new algorithm, or, alternatively, identifying a performance bottleneck. It might also be appropriate to measure the effort needed to achieve the desired performance, such as code porting and tuning.

Non-technical performance measurement seeks answers to questions such as: How well is the HPC facility/service run? How well is it used? How fair is it? This includes usage metrics, of which the most common is utilization. However, utilization on its own is a very poor headline metric for two reasons. First, it only reflects usage of the processors, not of the other important and costly components such as memory, the interconnect, resiliency infrastructure, and so on. Second, it embodies a fundamentally flawed assumption that “busier” is the same as “delivering more value!”

Measuring utilization of various system aspects does have a useful role to play in understanding system behavior. Although they are different metrics, energy and power often discussed as if they were the same thing. As a first approximation, power (peak demand) drives the cost of the infrastructure, whereas energy consumption drives the cost of ongoing operation. Each of these can be optimized, noting that optimized might not always mean reduced. For example, there may be quantifiable benefits to expending more energy to get a result faster. Optimization can be explored at hardware or software levels, or a combination of both.

Fairness and funding is about how utilization of a facility is shared among multiple users, through things like scheduling or queue setups, priority handling, and access review or allocations. It will also include a discussion of how the facility is funded, which can be subtly different to how the facility is charged. These aspects will have effects on usage patterns and thus hinder or help drive maximizing the business impact of the HPC capability.

Cost measurement is an essential precursor to ensuring the costs are appropriate to the value generated, or to secure investment to unlock further business or science value. The most visible cost of HPC is the hardware purchase or, in the case of a cloud computing contract or a leased system, the monthly supplier invoice. However, the Total Cost of Ownership, (TCO) has dozens of components even with a simplified treatment. This might include hardware, storage, maintenance, staff, electricity, facility costs, security, resilience, financing effects, application porting, software licenses, user training, and more. Consideration must be given to identifying which costs can be controlled, and by whom, and which costs are borne by which budgets, and how this might change over time.

 

 

Perhaps the most important thing to measure for HPC is the value generated. This can be essential in securing future investments, or for driving behavior within an organization. This is particularly relevant when considering expanding HPC use to facilitate new business initiatives or additional research projects.

The two main measures of value used in HPC are return on investment (ROI) and business or science impact. Fairly accounting for the return side of the ROI calculation is often an arduous task and is readily open to debate and challenge. ROI is useful, especially when tensioning against other investment choices. However, ROI alone is insufficient to properly capture the value of HPC. Business or science impact is normally used to address the broader value of HPC. This drives key business decisions or capabilities that rely on HPC, or advances science leadership that is only possible with a certain HPC capability.

Current rating: 4.5

Comments

Steve Conway 2 weeks, 1 day ago

Very useful perspective, Andy. We're on the same mission. We recently completed three studies that attempt to measure the value/impact of HPC, for DOE, NSF and another gov't agency, respectively. One study captures financial info on HPC investments and returns; the second polls HPC centers and industrial users on value of partnerships; the third captures data on revenue and jobs.

Link | Reply

Andrew 2 weeks ago

Yes, there is a lot of mission overlap. Your ability to report on sector-wide statistics and trends of the impact of HPC is a good complement to our focus on helping an individual organization understand their particular value, impacts, investments, etc. with respect to HPC.

Link | Reply

Matt McKee 2 weeks, 1 day ago

Andy, great article and well thought-out message. I agree, utilization does not equal business value. There tends to be a misalignment of incentives within an organization with respect to a capitalized asset. Unlike a machine on a manufacturing floor, higher utilization does not always mean its delivering the right (or more) value for the user base.

Link | Reply

Andrew 2 weeks ago

Thanks Matt. Utilization is handy for diagnosing how the HPC asset is being used/operated, but is very poor for establishing the value it is generating. Lower utilization may still mean better value (due to flexibility, turnaround, focus on the important workloads, etc.). It is common that too high a utilization can actually constrain business value.

Link | Reply

Raghav 2 weeks, 1 day ago

Nice article. Perhaps more education is needed on what are the sort of benchmarks that organisations can use today to evaluate the performance of their HPC systems. I feel nowadays with the wide variety of workloads that are available and the wide variety of infra, evaluating the power and performance of a HPC system using just HPL is insufficient.

Link | Reply

Andrew 2 weeks ago

Thanks. Indeed, HPL is only one of a wide range of possible standard benchmarks, and there are application codes that can be used too. I discuss this a little in my article for the print edition of Top500 News that will be available at SC next week.

In terms of education, our Mon morning tutorial on "HPC Acquisition and Commissioning" at SC17 has a section on benchmarking, including how to select benchmarks.

Plus, of course, NAG HPC consulting helps organizations with their particular performance needs (whether benchmarking or non-technical, e.g., TCO)

Link | Reply

Andrew 2 weeks ago

Thanks Matt. Utilization is handy for diagnosing how the HPC asset is being used/operated, but is very poor for establishing the value it is generating. Lower utilization may still mean better value (due to flexibility, turnaround, focus on the important workloads, etc.). It is common that too high a utilization can actually constrain business value.

Link | Reply

New Comment

required

required (not published)

optional