This section contains frequently asked questions about the the TOP500 project and list. It is still in a very early stage and more questions with answers will be added shortly. If you have any suggestions please let us know.
What is the TOP500?
The Top500 list the 500 fastest computer system being used today. In 1993 the collection was started and has been updated every 6 months since then. The report lists the sites that have the 500 most powerful computer systems installed. The best Linpack benchmark performance achieved is used as a performance measure in ranking the computers. The TOP500 list has been updated twice a year since June 1993.
The Linpack Benchmark
What is the Linpack Benchmark?
The Linpack Benchmark is a measure of a computer’s floating-point rate of execution. It is determined by running a computer program that solves a dense system of linear equations. Over the years the characteristics of the benchmark has changed a bit. In fact, there are three benchmarks included in the Linpack Benchmark report.
The Linpack Benchmark is something that grew out of the Linpack software project. It was originally intended to give users of the package a feeling for how long it would take to solve certain matrix problems. The benchmark stated as an appendix to the Linpack Users' Guide and has grown since the Linpack User’s Guide was published in 1979.
What is the Linpack Benchmark report?
The Linpack Benchmark report is entitled “Performance of Various Computers Using Standard Linear Equations Software”. The report lists the performance in Mflop/s of a number of computer systems. A copy of the report is available at http://www.netlib.org/benchmark/performance.ps.
What is the reference for the Linpack Benchmark Report?
The Linpack Benchmark report should be referenced in the following way:
“Performance of Various Computers Using Standard Linear Equations Software”, Jack Dongarra, University of Tennessee, Knoxville TN, 37996, Computer Science Technical Report Number CS - 89 – 85, today’s date, url:http://www.netlib.org/benchmark/performance.ps.
Is there a paper which describes the benchmark in some detail and gives a historical perspective?
The paper “The LINPACK Benchmark: Past, Present, and Future” by Jack Dongarra, Piotr Luszczek, and Antoine Petitet provides a look at the details of the benchmark and provides performance data in graphics form for a number of machines on basic operations. A copy of the paper is available at http://www.netlib.org/utk/people/JackDongarra/PAPERS/hpl.pdf.
What is a Mflop/s?
Mflop/s is a rate of execution, millions of floating point operations per second. Whenever this term is used it will refer to 64 bit floating point operations and the operations will be either addition or multiplication. Gflop/s refers to billions of floating point operations per second andTflop/s refers to trillions of floating point operations per second.
What is the theoretical peak performance?
The theoretical peak is based not on an actual performance from a benchmark run, but on a paper computation to determine the theoretical peak rate of execution of floating point operations for the machine. This is the number manufacturers often cite; it represents an upper bound on performance. That is, the manufacturer guarantees that programs will not exceed this rate-sort of a "speed of light" for a given computer. The theoretical peak performance is determined by counting the number of floating-point additions and multiplications (in full precision) that can be completed during a period of time, usually the cycle time of the machine. For example, an Intel Itanium 2 at 1.5 GHz can complete 4 floating point operations per cycle or a theoretical peak performance of 6 GFlop/s.
What are the three benchmarks in the Linpack Benchmark report?
The three benchmarks in the Linpack Benchmark report are for Linpack Fortran n = 100 benchmark (see Table 1 for the report), Linpack n = 1000 benchmark (see Table 1 of the report), and Linpack’s Highly Parallel Computing benchmark (see Table 3 of the report).
What is the Linpack Fortran n = 100 benchmark?
The first benchmark is for a matrix of order 100 using the Linpack software in Fortran. The results can be found in Table 1 of the benchmark report. In order to run this benchmark download the file from http://www.netlib.org/benchmark/Linpackd, this is aFortran program. In order to run the program you will need to supply a timing function called SECOND which should report the CPU time that has elapsed. The ground rules for running this benchmark are that you can make no changes to the Fortran code, not even to the comments. Only compiler optimization can be used to enhance performance.
What exactly does the Linpack Fortran n=100 benchmark time?
The Linpack benchmark measures the performance of two routines from the Linpack collection of software. These routines are DGEFA and DGESL (these are double-precision versions; SGEFA and SGESL are their single-precision counterparts). DGEFA performs the LU decomposition with partial pivoting, and DGESL uses that decomposition to solve the given system of linear equations.
Most of the time is spent in DGEFA. Once the matrix has been decomposed, DGESL is used to find the solution; this process requires O(n2) floating-point operations, as opposed to the O(n3) floating-point operations of DGEFA. The results for this benchmark can be found in Table 1 second column under “LINPACK Benchmark n = 100” of the Linpack Benchmark Report.
What is the Linpack n = 1000 benchmark (TPP, Best Effort)?
The second benchmark is for a matrix of size 1000 and can be found in Table 1 of the benchmark report. In order to run this benchmark download the file from http://www.netlib.org/benchmark/1000d, this is a Fortran driver. The ground rules for running this benchmark are a bit more relaxed in that you can specify any linear equation solve you wish, implemented in any language. A requirement is that your method must compute a solution and the solution must return a result to the prescribed accuracy. TPP stands for Toward Peak Performance; this is the title of the column in the benchmark report that lists the results.
Why is my performance results below the theoritical peak?
The performance of a computer is a complicated issue, a function of many interrelated quantities. These quantities include the application, the algorithm, the size of the problem, the high-level language, the implementation, the human level of effort used to optimize the program, the compiler's ability to optimize, the age of the compiler, the operating system, the architecture of the computer, and the hardware characteristics. The results presented for this benchmark suites should not be extolled as measures of total system performance (unless enough analysis has been performed to indicate a reliable correlation of the benchmarks to the workload of interest) but, rather, as reference points for further evaluations.
Why are the performance results for my computer different than the same machine’s results in the Linpack Report?
There are many reasons why your results may vary from results recorded in the Linpack Benchmark Report. Issues such as load on the system, accuracy of the clock, compiler options, version of the compiler, size of cache, bandwidth from memory, amount of memory, etc can effect the performance even when the processors are the same.
What is the Linpack’s “Highly Parallel Computing” benchmark?
The third benchmark is called the Highly Parallel Computing Benchmark and can be found in Table 3 of the Benchmark Report. (This is the benchmark use for the Top500 report). This benchmark attempts to measure the best performance of a machine in solving a system of equations. The problem size and software can be chosen to produce the best performance.
What are the ground rules for the first benchmark?
The “ground rules” for running the first benchmark in the report, n=100 case, are that the program is run as is with no changes to the source code, not even changes to the comments are allowed. The compiler through compiler switches can perform optimization at compile time. The user must supply a timing function called SECOND. SECOND returns the running CPU time for the process. The matrix generated by the benchmark program must be used to run this case.
What are the ground rules for the second benchmark?
The “ground rules” for running the second benchmark in the report, n=1000 case, allows for a complete user replacement of the LU factorization and solver steps. The calling sequence should be the same as the original routines. The problem size should be of order 1000. The accuracy of the solution must satisfy the following bound:
(On IEEE machines this is 2-53 ) and n is the size of the problem. The matrix used must be the same matrix used in the driver program available from netlib.
What are the ground rules for the third benchmark?
The “ground rules” for running the third benchmark in the report, Highly Parallel case, allows for a complete user replacement of the LU factorization and solver steps. The accuracy of the solution must satisfy the following bound:
(On IEEE machines this is 2-53 ) and n is the size of the problem. The matrix used must be the same matrix used in the driver program available from netlib. There is no restriction on the problem size.
To what accuracy must be the solution conform?
The solution to all three benchmarks must satisfy the following mathematical formula:
(On IEEE machines this is 2-53 ) and n is the size of the problem. This implies the computation must be done in 64 bit floating point arithmetic.
What numerical precision is required to run and benchmark and gain an entry in the Linpack Benchmark report?
In order to have an entry included in the Linpack Benchmark report the results must be computed using full precision. By full precision we generally mean 64 bit floating point arithmetic or higher. Note that this is not an issue of single or double precision as some systems have 64-bit floating point arithmetic as single precision. It is a function of the arithmetic used.
Can I get a more personalized list of machine and performance results?
What do I do to run the Linpack Benchmark Program?
For the 100x100 based Fortran version, you need to supply a timing function called SECOND. SECOND is an elapse timer function that will be called from Fortran and is expected to return the running CPU time in seconds. In the program two called to SECOND are made and the difference taken to gather the time.
How does the Linpack Benchmark performance relate to my application?
The performance of the Linpack benchmark is typical for applications where the basic operation is based on vector primitives such as added a scalar multiple of a vector to another vector. Many applications exhibit the same performance as the Linpack Benchmark. However, results should not be taken too seriously. In order to measure the performance of any computer it’s critical to probe for the performance of your applications. The Linpack Benchmark can only give one point of reference. In addition, in multiprogramming environments it is often difficult to reliably measure the execution time of a single program. We trust that anyone actually evaluating machines and operating systems will gather more reliable and more representative data.
Are there errors in the Linpack Benchmark report?
While we make every attempt to verify the results obtained from users and vendors, errors are bound to exist and should be brought to our attention. We encourage users to obtain the programs and run the routines on their machines, reporting any discrepancies with the numbers listed here.
What is Linpack?
The Linpack package is a collection of Fortran subroutines for solving various systems of linear equations. (http://www.netlib.org/Linpack/) The software in Linpack is based on a decompositional approach to numerical linear algebra. The general idea is the following. Given a problem involving a matrix, one factors or decomposes the matrix into a product of simple, well-structured matrices which can be easily manipulated to solve the original problem. The package has the capability of handling many different matrix types and different data types, and provides a range of options. Linpack itself is built on another package called the BLAS. Linpack was designed in the late 70's and has been superseded by a package called LAPACK.
How can I get the complete Linpack software collection?
The Linpack software library is available from netlib. See http://www.netlib.org/Linpack/
What are the BLAS?
The BLAS (Basic Linear Algebra Subprograms) are high quality "building block" routines for performing basic vector and matrix operations. Level 1 BLAS do vector-vector operations, Level 2 BLAS do matrix-vector operations, and Level 3 BLAS do matrix-matrix operations. Because the BLAS are efficient, portable, and widely available, they're commonly used in the development of high quality linear algebra software, LINPACK and LAPACK for example. For additional information see: http://www.netlib.org/blas/
Where can I get an optimized version of the BLAS?
The ATLAS (Automatically Tuned Linear Algebra Software) project is an ongoing research effort focusing on applying empirical techniques in order to provide portable performance for the BLAS routines. At present, it provides C and Fortran77 interfaces to a portably efficient BLAS implementation, as well as a few routines from LAPACK. For additional information see: http://www.netlib.org/atlas/
Is Linpack the most efficient way to solve systems of equations?
Linpack is not the most efficient software for solving matrix problems. This is mainly due to the way the algorithm and resulting software accesses memory. The memory access patterns of the algorithm has disregard for the multi-layered memory hierarchies of RISC architecture and vector computers, thereby spending too much time moving data instead of doing useful floating-point operations. LAPACK addresses this problem by reorganizing the algorithms to use block matrix operations, such as matrix multiplication in the innermost loops. For each computer architecture block operations can be optimized to account for memory hierarchies, providing a transportable way to achieve high efficiency on diverse modern machines. We use the term “Transportable” instead of “portable” because, for fastest possible performance, LAPACK requires that highly optimized block matrix operations be already implemented on each machine. These operations are performed by the Level 3 BLAS in most cases.
What is LAPACK?
LAPACK is a software collection to solve various matrix problem in linear algebra. In particular, systems of linear equations, least squares problems, eigenvalue problems, and singular value decomposition. The software is based on the use of block partitioned matrix techniques that aid in achieving high performance on RISC based systems, vector computers, and shared memory parallel processors.
How can I get the whole LAPACK software collection?
LAPACK can be obtained from netlib, see (http://www.netlib.org/lapack/)
What is the history behind the Linpack Benchmark?
The Linpack Benchmark is, in some sense, an accident. It was originally designed to assist users of the Linpack package by providing information on execution times required to solve a system of linear equations. The first ``Linpack Benchmark'' report appeared as an appendix in the Linpack Users' Guide in 1979. The appendix comprised data for one commonly used path in Linpack for a matrix problem of size 100, on a collection of widely used computers (23 in all), so users could estimate the time required to solve their matrix problem.
Over the years other data was added, more as a hobby than anything else, and today the collection includes hundreds of different computer systems.
How can I add my computer's result to the table?
You can contact Jack Dongarra and send him the output from the benchmark program. When sending results please include the specific information on the computer on which the test was run, the compiler, the optimization that was used, and the site it was run on. You can contact Dongarra by sending email to email@example.com.
What is the SECOND function?
In order to run the benchmark program you will have to supply a function to gather the execution time on your computer. The execution time is requested by a call to the Fortran function SECOND. It is expected that the routine returns the accumulated execution time of your program. Two called to SECOND are made and the difference taken to compute the execution time.
How can I measure the execution time more accurately and reliably?
The Performance API (PAPI) project specifies a standard application programming interface (API) for accessing hardware performance counters available on most modern microprocessors. These counters exist as a small set of registers that count Events, occurrences of specific signals related to the processor's function. Monitoring these events facilitates correlation between the structure of source/object code and the efficiency of the mapping of that code to the underlying architecture.
Should I run the single and double precision of the benchmarks?
The results reported in the benchmark report reflect performance for 64 bit floating point arithmetic. On some machines this may be DOUBLE PERCISION, such as computers that have IEEE floating point arithmetic and on other computers this may be single precision, (declared REAL in Fortran), such as Cray’s vector computers.
How can I interpret the results from the benchmark?
When and how often are the results updated in the benchmark report?
The benchmark report is updated continuously as new results arrive. They are posted to the web as they are updated.
What matrix is used to run the benchmark?
The matrices are generated using a pseudo-random number generator. The matrices are designed to force partial pivoting to be performed in Gaussian Elimination.
What is HPL?
HPL is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. It can thus be regarded as a portable as well as freely available implementation of the High Performance Computing Linpack Benchmark.
For HPL What problem size N should I run ?
In order to find out the best performance of your system, the largest problem size fitting in memory is what you should aim for. The amount of memory used by HPL is essentially the size of the coefficient matrix. So for example, if you have 4 nodes with 256 Mb of memory on each, this corresponds to 1 Gb total, i.e., 125 M double precision (8 bytes) elements. The square root of that number is 11585. One definitely needs to leave some memory for the OS as well as for other things, so a problem size of 10000 is likely to fit. As a rule of thumb, 80 % of the total amount of memory is a good guess. If the problem size you pick is too large, swapping will occur, and the performance will drop. If multiple processes are spawn on each node (say you have 2 processors per node), what counts is the available amount of memory to each process.
For HPL what block size NB should I use ?
HPL uses the block size NB for the data distribution as well as for the computational granularity. From a data distribution point of view, the smallest NB, the better the load balance. You definitely want to stay away from very large values of NB. From a computation point of view, a too small value of NB may limit the computational performance by a large factor because almost no data reuse will occur in the highest level of the memory hierarchy. The number of messages will also increase. Efficient matrix-multiply routines are often internally blocked. Small multiples of this blocking factor are likely to be good block sizes for HPL. The bottom line is that "good" block sizes are almost always in the [32 .. 256] interval. The best values depend on the computation / communication performance ratio of your system. To a much less extent, the problem size matters as well. Say for example, you empirically found that 44 was a good block size with respect to performance. 88 or 132 are likely to give slightly better results for large problem sizes because of a slightly higher flop rate.
For HPL what process grid ratio P x Q should I use ?
This depends on the physical interconnection network you have. Assuming a mesh or a switch HPL "likes" a 1:k ratio with k in [1..3]. In other words, P and Q should be approximately equal, with Q slightly larger than P. Examples: 2 x 2, 2 x 4, 2 x 5, 3 x 4, 4 x 4, 4 x 6, 5 x 6, 4 x 8 ... If you are running on a simple Ethernet network, there is only one wire through which all the messages are exchanged. On such a network, the performance and scalability of HPL is strongly limited and very flat process grids are likely to be the best choices: 1 x 4, 1 x 8, 2 x 4 ...
For HPL what about the one processor case ?
HPL has been designed to perform well for large problem sizes on hundreds of nodes and more. The software works on one node and for large problem sizes, one can usually achieve pretty good performance on a single processor as well. For small problem sizes however, the overhead due to message-passing, local indexing and so on can be significant.
For HPL why so many options in HPL.dat ?
There are quite a few reasons. First off, these options are useful to determine what matters and what does not on your system. Second, HPL is often used in the context of early evaluation of new systems. In such a case, everything is usually not quite working right, and it is convenient to be able to vary these parameters without recompiling. Finally, every system has its own peculiarities and one is likely to be willing to empirically determine the best set of parameters. In any case, one can always follow the advice provided in the tuning section of the HPL document and not worry about the complexity of the input file.
Can HPL be Outperformed ?
Certainly. There is always room for performance improvements. Specific knowledge about a particular system is always a source of performance gains. Even from a generic point of view, better algorithms or more efficient formulation of the classic ones are potential winners.
Can I use Strassen’s Method when doing the matrix multiples in the HPL benchmark or for the Top500 run?
The normal matrix multination algorithm requires n3 + O(n2) multiplications and about the same number of additions. Strassen's algorithm reduces the total number of operations to O(n2.82) by recursively multiplying 2n × 2n matrices using seven n × n matrix multiplications. Thus using Strassen’s Algorithm will distort the true execution rate. As a result we do not allow Strassen’s Algorithm to be used for the TOP500 reporting. As a side note, in the "usual" matrix multiplication, we have an n2 error term. In Strassen's method, the error exponent p for npranges from 2-3.85 and the numerical error can be 10-100 times greater than that for standard multiplication.
Where can I get the software to generate performance results for the Top500?
There is software available that has been optimized and many people use to generate the Top500 performance results. This benchmark attempts to measure the best performance of a machine in solving a system of equations. The problem size and software can be chosen to produce the best performance. A copy of that software can be downloaded from:
Why would a machine appear in the Linpack Benchmark report but not in the Top500 list?
There could be two reasons. First the Linpack Benchmark report contains historic information. Even if a computer is no longer in existence it can appear in the Linpack benchmark report. This is unlike the Top500 which report the 500 fastest computers in existence at a given point in time. The second reason is that the Top500 list come out twice a year and the Linpack Benchmark report is updated continuously.
Why would a machine appear in the Top500 list and not in the Linpack Benchmark report?
If a machine is in the Top500 list it should appear in the Linpack Benchmark report. If you see an instance where this is not the case, its probably a mistake and please send email to Jack Dongarra firstname.lastname@example.org about the situation.
How can I interpret the results from the Linpack 100x100 benchmark?
When the Linpack Fortran n = 100 benchmark is run it produces the following kind of results:
The norm. resid is a measure of the accuracy of the computation. The value should be O(1). If the value is much greater than O(100) it suggest that the results are not correct.
The resid is the unnormalized quantity.
The term machep measure the precision used to carry out the computation. On an IEEE floating point computer the value should be 2.22044605e-16.
The values of x(1) and x(n) are the first and last component of the solution. The problem is constructed so that the values of solution should be all ones.
There are two sets of timings performed both on matrices of size 100. The first one is where the 2-dimensional array that contained the matrix has a leading dimension of 201, and a second set where the leading dimension 200. This is done to see what effect, if any, the placement of the arrays in memory has on the performance.
Times for dgefa and dgesl are reported. dgefa factors the matrix using Gaussian elimination with partial pivoting and dgesl solves a system based on the factoriuzation. dgefa requires 2/3 n3 operations and dgesl requires n2 operations. The value of total is the sum of the times andmflops is the execution rate, or millions of floating point operations per second. Here a floating point operations is taken to be floating point additions and multiplications. Unit and ratio are obsolete and should be ignored.
If the time reported is negative or zero then the clock resolution is not accurate enough for the granularity of the work. In this case a different timing routine should be used that has better resolution.
Do you have an archive of previous Linpack Benchmark reports or results?
No archive is maintained of previous results. However here is some information to provide a historical perspective. The numbers in the following tables have been extracted from old Linpack Benchmark Reports. It took a bit of ``file archaeology'' to put the list together since I don't have the complete set of reports.
Top Computers Over Time for the Linpack n=100 Benchmark
(Entries for this table began in 1979.)
NEC SX-8/1 (1 proc)
Intel Pentium Nocona (1 proc 3.6 GHz)
HP Integrity Server rx2600 (1 proc 1.5GHz)
Intel Pentium 4 (3.06 GHz)
These numbers come from the Linpack Benchmark Report Table 1.
Top Computers Over Time for the Linpack n=1000 Benchmark
(Entries for this table began in 1986.)
Number of Processors
These numbers come from the Linpack Benchmark Report Table 1.
(Full precision; matrix size 1000; best effort programming, maximum optimization permitted.)
Top Computers Over Time for the Highly-Parallel Linpack Benchmark
(Entries for this table began in 1991.)
IBM Blue Gene/L
2002 - 2004
Earth Simulator Computer, NEC
ASCI White-Pacific, IBM SP Power 3
ASCI White-Pacific, IBM SP Power 3
ASCI Red Intel Pentium II Xeon core
ASCI Blue-Pacific SST, IBM SP 604E
Intel ASCI Option Red (200 MHz Pentium Pro)
Intel Paragon XP/S MP
Intel Paragon XP/S MP
These numbers come from the Linpack Benchmark Report Table 3.
(Full precision; the manufacture is allowed to solve as large a problem as desired, maximum optimization permitted.)
Measured Gflop/s is the measured peak rate of execution for running the benchmark in billions of floating point operations per second.
Size of Problem is the matrix size at which the measured performance was observed.
Size of ½ Perf is the size of problem needed to achieve ½ the measured peak performance.
Theoretical Peak Gflop/s is the theoretical peak performance for the computer.
What is the HPC Challenge benchmark?
The HPC Challenge benchmark consists at this time of 7 benchmarks: HPL, STREAM, RandomAccess, PTRANS, FFTE, DGEMM and b_eff Latency/Bandwidth. HPL is the Linpack TPP benchmark. The test stresses the floating point performance of a system. STREAM is a benchmark that measures sustainable memory bandwidth (in GB/s), RandomAccess measures the rate of random updates of memory. PTRANS measures the rate of transfer for larges arrays of data from multiprocessor’s memory. Latency/Bandwidth measures (as the name suggests) latency and bandwidth of communication patterns of increasing complexity between as many nodes as is time-wise feasible.
Where can I get additional information on the HPC Challenge benchmark?
The Linpack Benchmark suite is built around software for dense matrix problems. In May 2000 we started to put together a benchmark for sparse iterative matrix problems. For additional information see: http://www.netlib.org/benchmark/sparsebench/
Where can I get additional information on benchmarks?