A Generalist Agent

PoC/Research owned by DeepMind
1y ago update
Gato is the name of a new 1.18 billion-parameter AI model developed by DeepMind. It is a general-purpose model able to handle:

  1. multi-modality, i.e., to operate on different inputs such as text, images, and previous actions.
  2. multi-task, i.e., it can successfully identify the context of a specific task and execute successfully within the context.

The Gato model is inspired by large language models, e.g., GPT-3 and Flamingo, whose inputs consists of embedding vectors where each vector corresponds to a word. The idea of embeddings is used to map different inputs such as text, images, and continuous-valued vectors into a common space.

Is it possible to create a single AI model able to perform diverse tasks using multi-modal input data?
The underlying idea is quite simple and may be divided in two steps:

  1. Embed the input sources into a common vector space. For example, images are divided into normalized patches (similar to vision transformer) and fed through a ResNet to create an embedding for each patch. The embedding size is 2048.
  2. Use a decoder-only transformer to make predictions.

Gato is trained in a purely supervised way as of now (although the authors mention that reinforcement learning is indeed a viable option). The data used to train Gato are

  • Simulated control tasks where the training data is generated from specialist state-of-the-art reinforcement learning algorithms. For example, the control tasks include 3D vision navigation and Atari games.
  • The MassiveText dataset that contain web-pages, books, code etc.
  • Vision-language datasets, e.g., images with captions. These data contains over 2.1 billion samples.
  • Proprioception data from robotics.

Here is a short summary of the performance:

  • Gato exceed average human level on 23 Atari games and more than twice the performance of humans on 11 games.
  • On the task of stacking objects of previously unseen shapes, Gato achieves results on parity with specialized algorithms.
  • No performance metrics are provided for image captioning but the presented results imply good performance also on this task.

Attributes

Engineering
Prediction, Language, Robotics, Vision
DNN
Image Data, Sensor Data, Textual Data
}