By: TOP500 Team
Will future exascale supercomputers be able to withstand the steady onslaught of routine faults?
As a child, were you ever afraid that a monster lurking in your bedroom would leap out of the dark and get you? My job at Oak Ridge National Laboratory is to worry about a similar monster, hiding in the steel cabinets of the supercomputers and threatening to crash the largest computing machines on the planet.
The monster is something supercomputer specialists call resilience—or rather the lack of resilience. It has bitten several supercomputers in the past. A high-profile example affected what was the second fastest supercomputer in the world in 2002, a machine called ASCI Q at Los Alamos National Laboratory. When it was first installed at the New Mexico lab, this computer couldn’t run more than an hour or so without crashing.
The ASCI Q was built out of AlphaServers, machines originally designed by Digital Equipment Corp. and later sold by Hewlett-Packard Co. The problem was that an address bus on the microprocessors found in those servers was unprotected, meaning that there was no check to make sure the information carried on these within-chip signal lines did not become corrupted. And that’s exactly what was happening when these chips were struck by cosmic radiation, the constant shower of particles that bombard Earth’s atmosphere from outer space.
Read the full article on IEEE Spectrum.