One of the key challenges of deep reinforcement learning models, the kind of AI systems that have mastered Go, StarCraft 2, and other games, is their inability to generalize their capabilities beyond their narrow training domain. This limit makes it very hard to apply these systems to real-world settings, where situations are much more complicated and unpredictable than the environments where AI models are trained. The key advantage of reinforcement learning is its ability to develop behavior by taking actions and getting feedback, similar to the way humans and animals learn by interacting with their environment. Some scientists describe reinforcement learning as “the first computational theory of intelligence.”
The combination of reinforcement learning and deep neural networks, known as deep reinforcement learning, has been at the heart of many advances in AI, including DeepMind’s famous AlphaGo and AlphaStar models. In both cases, the AI systems were able to outmatch human world champions at their respective games.
But reinforcement learning systems are also notoriously renowned for their lack of flexibility. For example, a reinforcement learning model that can play StarCraft 2 at an expert level won’t be able to play a game with similar mechanics at any level of competency.
To this end, the team created XLand, an engine that can generate 3D environments composed of static topology and moveable objects. The game engine simulates physics and allows players to use the objects in various ways.
XLand is a rich environment in which you can train agents on a virtually unlimited number of tasks. One of the main advantages of XLand is the capability to use programmatic rules to automatically generate a vast array of environments and challenges to train AI agents.
DeepMind uses deep reinforcement learning and a few clever tricks to create AI agents that can thrive in the XLand environment. The reinforcement learning model of each agent receives a first-person view of the world, the agent’s physical state , and its current goal. Each agent finetunes the parameters of its policy neural network to maximize its rewards on the current task. The neural network architecture contains an attention mechanism to ensure the agent can balance optimization for the subgoals required to accomplish the main goal.
The performance of the reinforcement learning agents was evaluated based on their general ability to accomplish a wide range of tasks they had not been trained on. Some of the test tasks include well-known challenges such as “capture the flag” and “hide and seek.”
According to DeepMind, each agent played around 700,000 unique games in 4,000 unique worlds within XLand and went through 200 billion training steps across 3.4 million unique tasks .
“At this time, our agents have been able to participate in every procedurally generated evaluation task except for a handful that were impossible even for a human,” the AI researchers wrote. “And the results we’re seeing clearly exhibit general, zero-shot behaviour across the task space.”
According to DeepMind, the reinforcement learning agents exhibit the emergence of “heuristic behavior” such as tool use, teamwork, and multi-step planning. If proven, this can be an important milestone. Deep learning systems are often criticized for learning statistical correlations instead of causal relations. If neural networks could develop high-level notions such as using objects to create ramps or cause occlusions, it could have a great impact on fields such as robotics and self-driving cars, where deep learning is currently struggling.
Some of DeepMind’s top scientists published a paper recently in which they hypothesize that a single reward and reinforcement learning are enough to eventually reach artificial general intelligence . An intelligent agent with the right incentives can develop all kinds of capabilities such as perception and natural language understanding, the scientists believe.
Although DeepMind’s new approach still requires the training of reinforcement learning agents on multiple engineered rewards, it is in line with their general perspective of achieving AGI through reinforcement learning.
“What DeepMind shows with this paper is that a single RL agent can develop the intelligence to reach many goals, rather than just one,” Chris Nicholson, CEO of Pathmind, told TechTalks. “And the skills it learns in accomplishing one thing can generalize to other goals. That is very similar to how human intelligence is applied. For example, we learn to grab and manipulate objects, and that is the foundation of accomplishing goals that range from pounding a hammer to making your bed.”
Nicholson also believes that other aspects of the paper’s findings hint at progress toward general intelligence. “Parents will recognize that open-ended exploration is precisely how their toddlers learn to move through the world. They take something out of a cupboard, and put it back in. They invent their own small goals—which may seem meaningless to adults — and they master them,” he said. “DeepMind is programmatically setting goals for its agents within this world, and those agents are learning how to master them one by one.
In a nutshell, the paper proves that if you can create a complex enough environment, design the right reinforcement learning architecture, and expose your models to enough experience , you’ll be able to generalize to various kinds of tasks in the same environment. And this is basically how natural evolution has delivered human and animal intelligence.
In fact, DeepMind has already done something similar with AlphaZero, a reinforcement learning model that managed to master multiple two-player turn-based games. The XLand experiment has extended the same notion to a much greater level.