By: Michael Feldman
A team of researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) has devised a 64-core processor design that tackles the major impediment to utilizing multicore chips: programming them. The novel device, which is known as Swarm, incorporates extra circuitry that makes it much easier for programmers to parallelize their applications. At least that’s the claim of the research team.
In a write-up at MIT News, project lead Daniel Sanchez, an assistant professor in MIT’s Department of Electrical Engineering and Computer Science, talks about the design.
“Multicore systems are really hard to program,” said Sanchez. “You have to explicitly divide the work that you’re doing into tasks, and then you need to enforce some synchronization between tasks accessing shared data. What this architecture does, essentially, is to remove all sorts of explicit synchronization, to make parallel programming much easier. There’s an especially hard set of applications that have resisted parallelization for many, many years, and those are the kinds of applications we’ve focused on in this paper.”
Image: Christine Danifoff/MIT
The paper he is referring to is one that appears in the May/June issue of the Institute of Electrical and Electronics Engineers’ journal Micro, in which Swarm is described in greater detail. As far as we can tell, there is no actual hardware yet, so any claims of ease of programming and performance advantages are based on simulations and perhaps the enthusiasm of the researchers.
The MIT team compared six algorithms developed on Swarm against “the best existing parallel versions” devised by elite programmers. According to the researchers, the Swarm versions were between 3 to 18 times faster, yet required 1/10 as much code. In one case, using an algorithm that no one had been able to parallelize, Swarm delivered a 75-fold speedup.
How does it manage to do this? Special circuitry in Swarm is devoted to managing the parallel execution of rather small tasks – basically functions, which can consist of just a handful of instructions. If the programmer wants to parallelize a function, he/she must designate it as such and assign some weighted value to it that corresponds to an execution priority. The hardware applies that priority when it decides on execution order and which functions can be run in parallel.
That level of granularity for parallelization could increase performance significantly, probably enough to explain the speedup results the researchers are claiming. But the real intelligence in Swarm has to do with how it manages the parallel tasks, without the help of software task management constructs.
In particular, memory access conflicts are handled on-chip, whereby the hardware tracks every running task with timestamps for each memory address being written to. When two tasks are accessing a particular address out of order, the hardware undoes some of the execution to resync the tasks. For example, if a data item is written by a lower priority task before a higher priority task has read the old value, then the write operation of the first task is backed out to allow the second task to read the correct value.
The next obvious step would be to build a physical prototype and run whole applications on it. At that point, we would get a better idea of how to gauge performance speedups across different types of codes against the additional overhead of hardware task management. In the meantime, you can read the Swarm paper in the current issue of IEEE Micro, which unfortunately will set you back 33 USD, 13 USD if you belong to IEEE.