Much of the recent progress in AI has come from building ever-larger neural networks. A new chip powerful enough to handle “brain-scale” models could turbo-charge this approach.
Chip startup Cerebras leaped into the limelight in 2019 when it came out of stealth to reveal a 1.2-trillion-transistor chip.
The size of a dinner plate, the chip is called the Wafer Scale Engine and was the world’s largest computer chip. Earlier this year Cerebras unveiled the Wafer Scale Engine 2 (WSE-2), which more than doubled the number of transistors to 2.6 trillion.
Now the company has outlined a series of innovations that mean its latest chip can train a neural network with up to 120 trillion parameters. For reference, OpenAI’s revolutionary GPT-3 language model contains 175 billion parameters. The largest neural network to date, which was trained by Google, had 1.6 trillion.
“Larger networks, such as GPT-3, have already transformed the natural language processing landscape, making possible what was previously unimaginable,” said Cerebras CEO and co-founder Andrew Feldman in a press release.
“The industry is moving past 1 trillion parameter models, and we are extending that boundary by two orders of magnitude, enabling brain-scale neural networks with 120 trillion parameters.”
The genius of Cerebras’ approach is that rather than taking a silicon wafer and splitting it up to make hundreds of smaller chips, it makes a single massive one. While your average GPU will have a few hundred cores, the WSE-2 has 850,000. Because they’re all on the same hunk of silicon, they can work together far more seamlessly.
This makes the chip ideal for tasks that require huge numbers of operations to happen in parallel, which includes both deep learning and various supercomputing applications. And earlier this week at the Hotchips conference, the company unveiled new technology that is pushing the WSE-2’s capabilities even further.
A major challenge for large neural networks is shuttling around all the data involved in their calculations. Most chips have a limited amount of memory on-chip, and every time data has to be shuffled in and out it creates a bottleneck, which limits the practical size of networks.
The WSE-2 already has an enormous 40 gigabytes of on-chip memory, which means it can hold even the largest of today’s networks. But the company has also built an external unit called MemoryX that provides up to 2.4 Petabytes of high-performance memory, which is so tightly integrated it behaves as if it were on-chip.
Cerebras has also revamped its approach to that data it shuffles around. Previously the guts of the neural network would be stored on the chip, and only the training data would be fed in. Now, though, the weights of the connections between the network’s neurons are kept in the MemoryX unit and streamed in during training.
By combining these two innovations, the company says, they can train networks two orders of magnitude larger than anything that exists today. Other advances announced at the same time include the ability to run extremely sparse (and therefore efficient) neural networks, and a new communication system dubbed SwarmX that makes it possible to link up to 192 chips to create a combined total of 163 million cores.
How much all this cutting-edge technology will cost and who is in a position to take advantage of it is unclear. “This is highly specialized stuff,” Mike Demler, a senior analyst with the Linley Group, told Wired. “It only makes sense for training the very largest models.”
While the size of AI models has been increasing rapidly, it’s likely to be years before anyone can push the WSE-2 to its limits. And despite the insinuations in Cerebras’ press material, just because the parameter count roughly matches the number of synapses in the brain, that doesn’t mean the new chip will be able to run models anywhere close to its complexity or performance.
There’s a major debate in AI circles today over whether we can achieve general artificial intelligence by simply building larger neural networks, or this will require new theoretical breakthroughs. So far, increasing parameter counts has led to pretty consistent jumps in performance. A two-order-of-magnitude improvement over today’s largest models would undoubtedly be significant.