News

Aussies Build HPC Cluster/Cloud Hybrid

None
July 5, 2016
By: Michael Feldman

The University of Melbourne has deployed a new type of HPC system that combines both a physical cluster and a virtual one. Called Spartan, the new machine is built around the idea that there are basically two types of HPC users: the so-called power users, who want lots of compute, memory, and bandwidth for long-running applications; and those with more modest requirements, who need to run a plethora of much smaller jobs. Spartan provides resources aimed at both audiences.

The cluster is based on Dell servers, along with switches from Mellanox and Cisco switches, but Spartan divides the hardware into two partitions. According to the specifications outlined by HPC support engineer Lev Lafeyette, the breakdown is as follows:

  • The physical partition is made up of 256 cores of Xeon E5 processors, running at 3.4 GHz. Each node is equipped with 294 GB of memory. The system network uses 25G/56G Ethernet, with Mellanox SN2700 and SN2100 leaf switches, and with 100G connections between nodes. This partition is run as a bare metal resource.
  • The cloud partition is larger, with 1024 cores, but based on the lesser Xeon E3 processors, running at 2.3 GHz. Each node contains 64 GB of memory. The system employs 10G Ethernet as its network, and runs as a virtual resource.

The rationale is that there are many more users with smaller jobs compared to the number of users with big jobs. By directing these smaller jobs to the virtual resource, throughput is optimized, and more demanding applications don’t get stuck waiting in the job queue for the smaller ones to complete. And presumably, since the partitions are part of a single system, resources between the two can be shared when the mix of big and small users don’t match the partition resources available.

According to Lafeyette, 96 software applications have been installed on Spartan, along with an array of support tools, compilers and libraries. On top of that, runs Linux as the OS, SLURM as the workload manager, and Puppet as the system manager.

Such system software can be used to advantage when the machine is expanded with additional hardware or nodes have to be repaired or replaced. In these scenarios, running applications can be moved off the affected nodes without the need for a system shutdown. The flexibility of such a model is a huge advantage to a large university community, where down-time can delay the work of hundreds of researchers.

Whether this bare metal/cloud cluster proves to be a better solution than the more typical third-party cloud bursting model remains to be seen, but it certainly provides more in-house control than the latter. And given that these types of users tend to prefer that level of control, Spartan indeed may turn out to be the cloud paradigm for the HPC masses.