Compared to a typical CPU, a brain is remarkably energy-efficient, in part because it combines memory, communications, and processing in a single execution unit, the neuron. A brain also has lots of them, which lets it handle lots of tasks in parallel.

Attempts to run neural networks on traditional CPUs run up against these fundamental mismatches. Only a few things can be executed at a time, and shuffling data to memory is a slow process. As a result, neural networks have tended to be both computationally and energy intensive. A few years back, IBM announced a new processor design that was a bit closer to a collection of neurons and could execute neural networks far more efficiently. But this didn't help much with training the networks in the first place.

Now, IBM is back with a hardware design that's specialized for training neural networks. And it does this in part by directly executing the training in a specialized type of memory.
Changing phases

Phase-change memory is based on materials that can form two different structures, or phases, depending on how quickly they cool from a liquid. As the conductance of these phases differ, it's possible to use this to store bits. It's also possible to control the temperature such that the bit enters a state with intermediate conductance. In addition to storing bits, this can be used to perform calculations, as a bunch of sub-threshold phase changes can gradually add up to a bit flip.

The advantages to doing calculations this way are twofold: you don't need to make trips back and forth to memory since the operations take place in memory, and many, many operations can be done in parallel. Those differences have natural parallels with the behavior of a population of neurons, which makes phase-change memory a (potentially) good fit for neural networks.

In fact, there's an additional parallel. Neuronal activity isn't a binary, all-or-nothing state—it can adopt a range of intermediate behaviors between on and off. Thus, the phase-change memory's ability to adopt a state between 1 and 0 allows it to directly model the behavior of neurons.

To use this for training, a grid of phase-change memory bits can be mapped to each layer of a neural network. A communication network, made of more traditional wiring, allows the neurons to communicate among themselves. The strength of that communication is set by the state of the memory—where it is on the spectrum between fully on and fully off. That state, in turn, is set by all the bits that feed into it. The communication hardware translates the variable-strength signals from a phase-change bit into signals of different durations, which are compatible with the digital communication network.

In principle, this all ties together nicely. In practice, however, there were two problems. One is simply that the hardware doesn't have the same range of states between 1 and 0 that we need to make neural networks effective. The second is that there's bit-to-bit variability in how the pieces of this system respond. So the IBM team came up with a two-level system for training.
Level up

At the top level, the neural network is implemented using phase-change bits, as described above. But this hardware isn't updated each training instance. Instead, a few hundred trials are done using separate hardware, and the results of that training are then integrated into the phase-change bits. This updating process can be made to happen only when the training would change the value of the bits, which increases the efficiency.

The phase-change network can be viewed as the first digit of a two-digit number—its value has a larger impact on the behavior of individual neurons. The second level, implemented in more traditional silicon, acts like the second digit, in that it has a smaller influence on the overall behavior. The combination of these two bits increases the range of values that individual neurons in the network can have.

This is also the level where each individual training cycle happens. The researchers mimic the behavior of a phase-change bit by hooking up the gate of a transistor to a capacitor. The charge stored in the capacitor then influences the strength of the gate, and updating the charge changes the strength, much like altering the phase-change bit. The charge storage in the capacitor is volatile, though, and will be lost in a matter of microseconds. This sets a limit to how many training cycles can be done before the capacitor's value is used to update the phase-change bit.

In short, the two-layer system allows training to take place in a volatile setup using traditional hardware, and the training results are then used to update the phase-change bit, which acts as permanent storage. The values associated with the individual "neurons" can be read out at any time and used to replicate the behavior on any compatible neural network, including IBM's dedicated chips.

Because this setup can be reconfigured on the fly, working around the variability of individual transistors is possible. For example, the hardware used to move information between layers on one training cycle can be used to handle back-propagation of errors the next, and vice versa. Over time, this should average out the differences caused by device variability.
Fast and efficient

It all sounds good, but does it work? The authors set it loose on a standard training and testing image set of handwritten numbers (called MNIST). The results were typically within a single percent of the accuracy provided by a standard neural-network package called TensorFlow.

But the key thing is the efficiency. The researchers estimate that, if their hardware were built using a current-process technology (it was made using a rather ancient 90nm process), it would outperform a state-of-the-art GPU by a factor of at least 100 and be more than 100 times more power-efficient.

Of course, those are estimates. And, to a certain extent, all of this is. While the team did implement its phase-change-based layers in hardware, some of the details of the capacitor-based layers were done in a simulator. There's no reason to think they couldn't be implemented in hardware, but there will undoubtedly be some details and possible issues that won't become apparent until they actually are.

Still, the potential efficiency gains are substantial, and IBM has already shown its willingness to build dedicated neural-network hardware. So there's likely to be some real-world numbers to back this report up before too long.