A Generalist Agent

PoC/Research owned by DeepMind
1y ago update
Gato is the name of a new 1.18 billion-parameter AI model developed by DeepMind. It is a general-purpose model able to handle:

  1. multi-modality, i.e., to operate on different inputs such as text, images, and previous actions.
  2. multi-task, i.e., it can successfully identify the context of a specific task and execute successfully within the context.

The Gato model is inspired by large language models, e.g., GPT-3 and Flamingo, whose inputs consists of embedding vectors where each vector corresponds to a word. The idea of embeddings is used to map different inputs such as text, images, and continuous-valued vectors into a common space.

Is it possible to create a single AI model able to perform diverse tasks using multi-modal input data?
The underlying idea is quite simple and may be divided in two steps:

  1. Embed the input sources into a common vector space. For example, images are divided into normalized patches (similar to vision transformer) and fed through a ResNet to create an embedding for each patch. The embedding size is 2048.
  2. Use a decoder-only transformer to make predictions.

Gato is trained in a purely supervised way as of now (although the authors mention that reinforcement learning is indeed a viable option). The data used to train Gato are

  • Simulated control tasks where the training data is generated from specialist state-of-the-art reinforcement learning algorithms. For example, the control tasks include 3D vision navigation and Atari games.
  • The MassiveText dataset that contain web-pages, books, code etc.
  • Vision-language datasets, e.g., images with captions. These data contains over 2.1 billion samples.
  • Proprioception data from robotics.

Here is a short summary of the performance:

  • Gato exceed average human level on 23 Atari games and more than twice the performance of humans on 11 games.
  • On the task of stacking objects of previously unseen shapes, Gato achieves results on parity with specialized algorithms.
  • No performance metrics are provided for image captioning but the presented results imply good performance also on this task.


Prediction, Language, Robotics, Vision
Image Data, Sensor Data, Textual Data