My HPC experience spans four decades, and I never cease to be amazed at all the considerations involved when buying a new HPC system. Of course, I have spoken with directors of HPC centers at universities, research institutes, and commercial enterprises about their centers and procurements over the years. However, based on recent conversations with Andrew Jones, VP of HPC Consulting at NAG, I became interested in documenting insights of how organizations navigate the opportunities, complexity, and risks when investing in a new HPC system.
Making the Best HPC Investments
HPC investments, including purchases of hardware and software, as well as time spent developing and supporting HPC applications, involve numerous options and trade-offs, meaning that making optimal decisions is far from easy.
It should not be surprising that organizations employ internal and external help during the procurement process. Leveraging internal expertise, vendor advice, and independent external experience are all important. The best HPC managers know the importance of leveraging all three. Well informed decisions lead to more effective deployments, with properly managed risks, which means higher impact from those investments.
Conversely, relying on only one of these (e.g., internal staff only) greatly increases the risks surrounding the investment. For example, only engaging with vendors for the transaction of a specific purchase fails to take advantage of their expertise, or not involving external advice and experience risks wrestling with issues that others have solved or could warn about.
Making a buying decision involves risks in many dimensions. Typically, you will be considering several options, probably from different vendors. Vendors do an excellent job promoting their own platforms, and sometimes offering views on the shortcomings of their competition - no surprise there. Thus, it is the job of your team, comprising internal and external expertise to assess your needs, compare your options, identify risks, and determine a balanced and justifiable way forward.
Keys to “de-risking” include a deep and honest review of:
Competitive assessment: your own capabilities and shortcomings, and available resources to elevate your capabilities to the right level. Explicitly and effectively leveraging external expertise is a key quality enabler for every organization using HPC, no matter what size, or how mature.
Requirements and Benchmarks: your requirements and how vendors can best meet them. Quantify performance on real or expected workloads as accurately as possible.
Lifetime costs: the full costs including people, power, infrastructure, software, measurement, project delivery, and maintenance.
Timing: the availability of technologies which affect your capabilities and your competition. Being too early, or too late, can substantially affect competitiveness. Phasing of delivery might be a powerful option to upgrade a system to use new technologies as they become available. Knowledge of a vendor’s longer-term roadmap can be important to manage risks.
Competitive assessments – Know thyself
Understanding the competitiveness of your plans can be essential to get the best impact. It is easy and common for organizations, who aim for a leadership position, to not have a realistic view of what that requires – or worse, being so sure of themselves, they don’t even check!
When I asked NAG’s Andrew Jones about this, he offered this: “We conduct impartial reviews for customers around the world through our HPC consulting, with a wide range of results. We encounter organizations who believe they are falling behind, but are actually leading in some or many aspects of their HPC. We also encounter organizations who are self-assured they are at the leading edge, but are actually trailing in many aspects of their HPC. Sadly, the latter group are the least likely to ask for the external help they need!”
A formal review of the competitiveness of an organization’s HPC strategy and operational delivery can provide a framework for key topical questions such as “cloud vs. in-house systems.” These questions should to be addressed ahead of the procurement (even if the outcome is to structure a procurement to allow either option). Once an organization moves to procurement, it is too late for such key directional conversations, without incurring risks, potential delays, and last-minute challenges from senior management.
Many organizations perform regular competitive reviews. It might be done every five years to help keep an organization at its best by having independent advice on their HPC operations. Or, it could be done in the context of a specific investment planning – e.g., to help build a business case for their next procurement.
In either case, such a review will investigate all aspects of an HPC operation, including actual usage, user needs, budgets, queuing structures for machines, user satisfaction, utilization, architecture, planning, quality of engagements, connections with academic centers, vendor engagements, risk assessments, in-house vs. cloud decisions, metrics, and more.
Some organizations run their own reviews, and others engage external specialists to help them. Either way, it seems that the best run HPC operations have some form of regular review, or internal audit, to guide them. Internal reviews are prone to self-promotion. To avoid this damaging action, an unbiased external review is essential to offer the best competitive understanding of where the organization is truly leading, and where improvements can be made.
Requirements and Benchmarks
Nothing is more important than relating your HPC investment decision to your organization’s needs and goals. One critical way of accomplishing this is with “benchmarks.” These should be representative of the actual workloads you expect to run on the procured machine.
It is important to remember that benchmarks are only ever an approximation for real workloads. But, when properly used, they can give valuable data on likely performance for the workloads that are important to you, and on how much difficulty is involved in getting that performance.
Andrew told me “When doing benchmarking studies, we avoid labelling procurement options as binary good or bad. As important as the performance figures themselves, is qualifying the effort required to get that performance, and an understanding of the architectural reasons behind the performance. In particular, we work to find information that connects a buying decision to actual needs and risks of meeting those needs. Whilst we do a lot of benchmarking ourselves, often the benchmarking is done by customer staff, or by the vendors, and our value is in helping the HPC leaders to clarify the meaning of the benchmark data, joining the dots between various sides’ viewpoints and the business decisions that need to be made based on the technical data.”
The highest performance (whether peak or benchmark), or lowest price/performance should not necessarily be guarantees of winning. While those are both very important, there are other factors such as the ease of porting applications, any constraints (e.g., performance only if fitting within the high bandwidth memory), variability, and more.
It is important to figure out these constraints. No company runs only a single code. Even in a situation where a single application is dominating, it will usually consist of numerous algorithms that need analysis. Therefore, in a procurement it is important to understanding the performance possibilities as well as potential performance losses.
It is often the case that of a set of benchmarks 1, 2, 3, 4, 5, and 6, it turns out that “1, 2, 3 are fastest on machine A” and “4, 5, 6 are fastest on machine B.” However, it is important to assess the performance losses of “1, 2, 3 on machine B” and of “4, 5, 6 on machine A”, since we will choose a single machine. In other words, it is not only the best results that buyers need to worry about, but the potential worst performances.
The decision might be between a system that does “good enough” on the majority of the workload and a system that excels on some workloads, but crawls on other workloads. Proper analysis needs to give an opinion on whether there is a reasonable chance to perform well on given choice, or understand if it is unlikely or risky to expect acceptable performance on a particular choice. Experts can help understand what the likely level of performance to expect is, and what is the effort that is needed to get that performance.
The Procurement Pinches: Capacity and Experience
It seems to me that most customers can run their own procurements. So, I asked Andrew why anyone would hire NAG to help. He told me, “Yes, many organizations have the ability to do this in house. We help with capacity and experience. We normally do this in partnership with organizations. We augment their team, acting as a temporary enhancement of their capacity and experience. Most customers are only buying a new machine every couple of years, whereas we are involved in HPC planning and procurement projects on a continuous basis. Plus, we are also seeing what others are doing across a diverse range of projects, which helps us accumulate trends on solutions, risks and possibilities. Fundamentally, our customers get a de-risking by bringing us in. They are gaining years of diverse experience, not just the few days or weeks we spend with the customer.”
Of course, highly reputed as they are, NAG isn’t the only source of outside help. My main message in this article is that finding, and leveraging, experiences from others should be part of any procurement process. Exercise the experts in your circles to help you!
Experienced buyers know to seek help
I find it interesting that it is the more experienced HPC leaders that consistently seek additional opinions and help. When I asked Andrew if they helped mostly inexperienced customers, he told me: “No, quite the opposite. Think of the big HPC users - oil majors, large aerospace companies, formula one teams, manufacturing type organizations, and large HPC centers. We can add value to HPC investment projects as small as a few $100k or as large as those spending over $100m. We are often engaged by organizations who have substantial internal HPC experience – indeed because of their maturity they see the value in de-risking and innovating using our independent specialist expertise.”
This lead me to ask Andrew: “If I was going to spend only $100,000 on a machine, how would I gain value by hiring you?” Andrew’s response was honestly put: “Actually, at the scale of $100K HPC investment spend, you probably should not hire us. We would be happy to talk with you, perhaps offering some generic advice remotely. But, at that scale, our integrity means we would tell you that the costs of a formal engagement would be disproportionate to the project size and the benefits obtained. Even then, there are those who ask us to go ahead and help de-risk their purchase despite the expense. This might occur when an organization is undergoing some type of transition, or is particularly new to HPC. If you were spending many $100k and into the millions, then we would be confident we could add clear value. At the other end of the scale, if you were spending anything over $10M, we’d say augmenting your team with external specialists – whether NAG or others – should be expected.”
Everything being right is okay too
I wondered how organizations might feel spending money on a review if the result shows that the company was doing everything right anyway. Andrew told me “Well, I haven’t yet found such a perfect organization, and if someone claimed they were top of the game at everything, I’d probably look at that self-evaluation with a generous dose of skepticism! More seriously, we have, on rare occasions, encountered companies who on initial inspection are ‘good enough’ or ‘leading’ in almost all aspects, so we have advised them on better ways to spend their money than on a deeper review by us — that integrity thing again.
However, in the many cases where we find that most things are done well, or that no critical changes are needed – this is still valuable information to that organization. Having someone verify you are doing well is meaningfully different to believing you are doing well. It is why we grow up with exams and certifications in our personal lives. It helps frame the baseline for future investment and provide a qualified success story to stakeholders.”
In the end, it is all about being competitive. Regular HPC operational reviews, honest checks on direction including the balance of in-house and in-the-cloud, budget split between hardware, software and people, technology planning, understanding total cost of ownership, timing for competitiveness, and holistic assessments of benchmarks are all critical elements when one seeks to use HPC capabilities as a competitive advantage. In this article, I shared some insights on the practices of successful HPC leaders leading up to procurements. In a future article, I will talk about techniques to increase post-procurement success as well.
Andrew Jones, quoted above, is VP of HPC Consulting at NAG, a not-for-profit Center-of-Excellence in the technical and business aspects of HPC and numerical software engineering. He will be co-tutoring a series of tutorials at SC17 on HPC investment planning and procurement, which cover the issues described here in more detail. Andrew also has an active social media presence on these topic, under his ‘@hpcnotes’ handle.