The Bull NovaScale
| Machine type | ccNUMA system. |
|---|---|
| Models | NovaScale 5165, 5325. |
| Operating system | Linux, WindowsServer 2003, GCOS 8 |
| Connection structure | Full crossbar |
| Compilers | Intel's Fortran 95, C(++) |
| Vendors information Web page | http://www.bull.com/novascale/ |
| Year of introduction | 2005. |
System parameters:
| Model | NovaScale 5165 | NovaScale 5325 |
|---|---|---|
| Clock cycle | 1.6 GHz | 1.6 GHz |
| Theor. peak performance | 102.4 Gflop/s | 204.8 Gflop/s |
| No. of processors | 4—16 | 8—32 |
| Main Memory | 8—256 GB | 16—512 GB |
| Comm. bandwidth | ||
| Point-to-point | 6.4 GB/s | 6.4 GB/s |
| Aggregate | 12.8 GB/s | 25.6 GB/s |
Remarks:
The NovaScale 5005 series is the second generation of Itanium-2 based systems targetting the HPC field. Besides the two models listed under System Parameters it also includes the 5085 and 5245 models which we do not discuss separately as they are simply models with a maximum of 8 and 24 processors, respectively. The main difference with the first generation, the 5160 and 5230 systems, is the doubling of the density: where the 5230 had to be housed in two 40 U racks, a 5325 systems fit in one rack and the 5165 can be housed in a 19 U rack. In about all other regards the new series is equal to the first generation.
The NovaScales are therefore ccNUMA SMPs. They are built from standard Intel Quad Building Blocks (QBBs) each housing 4 Itanium 2 processors and a part of the memory. The QBBs in turn are connected by Bull's proprietary FAME Scalability Switch (FSS) providing an aggregate bandwidth of 25.6 GB. For reliability reasons a NovaScale 5165 is equipped with 2 FSSes. This ensures that when any link between a QBB and a switch or between switches fails the system is still operational, be it on a lower communication performance level. As each FSS has 8 ports and only 6 of these are occupied within a 5165 system, the remaining ports can be used to couple two of these systems thus making a 32-processor ccNUMa system. Larger configurations can be made by coupling systems via QsNet II (see section QsNet). Bull provides its own MPI implementation which turns out to be very efficient (see "Measured Performances" below and [43]).
A nice feature of the NovaScale systems is that they can be partitioned such that different nodes can run different operating systems and that repartitioning can be done dynamically. Although this is not particularly enticing for HPC users, it might be interesting for other markets, especially as Bull still has clients that use their proprietary GCOS operating system.
The documentation from Bull states that the systems are "Montecito ready" which is true in the sense that the same socket can be used for the Itanium 2 and its successor, the dual-core Montecito (see the page on the Itanium processor). And in fact, Montecito-based systems are expected within a shortly (around July, August 2006). There is a proviso with respect to replacement of Itanium 2 processors by Montcitos: the latter are because of their dual cores obviously more bandwidth-hungry than the Itanium 2 processors. Because of that Montecitos with a Front Side Bus (FSB) of 10.6 GB/s will become available. However, when one simply exchanges processors one cannot use these higher bandwidth chips because the original E8870 chipset within the systems has a maximum bandwidth of 6.4 GB/s. Therefore, the benefit for bandwidth-limited application would also be limited in this case.
Measured Performances:
In the spring of 2004 rather extensive benchmark experiments with the EuroBen
Benchmark were performed on a 16-processor NovaScale 5160 with the 1.3 GHz
variant of the processor. Using the EuroBen benchmark, the MPI version of a
dense matrix-vector multiply was found to be 13.3 Gflop/s on 16 processors
while both for solving a dense linear system of size N = 1,000 and a 1-D
FFT of size N = 65,356 speeds of 3.3—3.4 Gflop/s are observed (see
[43]).
For the recently installed Tera-10 system at CEA, France, a Linpack performance
of 42,900 Gflop/s out of 55,705.6 Gflop/s installed is reported. An efficiency
of 77% on a linear system of unknown rank
([45]).