What did the cynical and curious @hpcnotes spot in the TOP500 list?
Stuffing the list
The TOP500 list is an intensely valuable tool for the HPC community, tracking aggregate trends over 25 years. However, a few observers have noted that recent publications of the TOP500 list have many duplicate entries, often at anonymous sites.
Let’s park the debate on what is a legitimate HPC system or not for now and assume that any system that has completed a High Performance Linpack (HPL) run is a fair entry on the list.
But, there are 124 entries on the list that are identical copies of other entries. In other words, a single HPL run has been done on one system, and the vendor has said “since we have sold N copies of that machine, we can submit N entries on the list.”
What happens to the list statistics if we delete all the duplicate systems?
The list of 500 reduces to 376 entries.
The biggest change is Lenovo, dropping from 117 entries to just sixty-one – yes, there are 56 duplicate entries from Lenovo! HPE drops from 79 to 62 entries and retakes the top spot for greatest share of the list with 16.5 percent. Lenovo drops to second place with 16.2 percent share.
Does this matter? Well, it probably does to Lenovo, who chose to submit many copies, and HPE, who probably sold many copies but chose not to submit the duplicates. And ultimately it matters to their market share PR.
For the rest of us, it comes down to what the list is about. If it supposed to list the 500 fastest supercomputers in the world, clearly it doesn’t do that as many supercomputer owners choose not to acknowledge their systems. Is it the list of known supercomputers? No, because several known supercomputers are not listed, for example, Blue Waters. So, it can only be a list of acknowledged HPL runs, which would suggest that the “N copies” approach is wrong.
However, it isn’t as simple as that. If the list is for tracking vendor/technology market share, then list stuffing is fine – even needed. If the list is for tracking adoption of technologies, vendor wins, the fortunes of HPC sites, and progress of nations, then I’d argue that stuffing in this way breaks the usefulness of the list.
The comparison of progress of nations is also affected. Do we measure how many systems deployed? Or who has the biggest system? I think the list stuffing is less critical here, as we can readily extract both trends.
I’m not sure there is a right or wrong answer to this but, as always, the headline statistics of market share only tell one side of the story, and users of the list must dig deeper for appropriate insight.
Another value of the list over the years has been the ability to track who is buying supercomputers and what they are using them for.
In the June 2018 list, 97 percent of the systems do not specify an application area. This suggests this categorization is essentially meaningless. Either drop it from future lists or require future systems to identify an application area.
But, beyond that, the June 2018 List has 283 (!) anonymous entries. That is over half of the list where we do not know the company or site that has deployed the system nor, in most cases, what it is being used for.
The big question is: does that render meaningless the ability to track the who and what of HPC, or would we be in a worse position if we excluded anonymous systems?
There are maybe 238 systems that can be lumped into cloud / IT service / web hosting companies, and it is the sheer quantity of these that provides the potentially unhelpful distortion.
The remaining 45 arguably represent real HPC deployments and have enough categorization (e.g., “Energy Company”) to be useful. Useful guesses can even be made for some of these anonymous systems. For example, one might guess that “Energy Company (A)” located in Italy is actually Eni. The 45 systems are a mix of energy, manufacturing, government, or finance sectors. Most interestingly, a couple of university systems are listed anonymously too!
Supercomputers in Industry
Of the 284 systems listed as “Industry,” only 16 systems actually have named owners that aren’t cloud providers (the other 268 are mostly the anonymous companies and “stuffing” discussed above). In fact, due to multiple systems per site, we actually only have a very small number of listed companies. These are Eni, Total, PGS, Saudi Aramco, BASF, EDF, Volvo and the vendors Intel, NVIDIA, and PEZY.
I assure you there are many more supercomputers out there in industry than this. I intend to explore the TOP500 trends, along lessons from my impartial HPC consulting work, of HPC systems in industry for a future article. Keep an eye out for it after ISC18.
I have previously said that the HPC is lucky to have the TOP500 list. It is a data set gathered in a consistent manner twice per year for 25 years, with each entry recording many attributes – vendor, technology details, performance, location, year of entry, etc. This is a hugely rich resource for our community, and the TOP500 authors have done an amazing job over 25 years to keep the list useful as the world of HPC has changed enormously. I have high confidence in the authors successfully addressing the challenges listed here and keeping the list as an invaluable resource for years to come.
Andrew Jones can be contacted via twitter (@hpcnotes), via LinkedIn (https://www.linkedin.com/in/andrewjones/), or via the TOP500 editor.