A Generalist Agent
- multi-modality, i.e., to operate on different inputs such as text, images, and previous actions.
- multi-task, i.e., it can successfully identify the context of a specific task and execute successfully within the context.
The Gato model is inspired by large language models, e.g., GPT-3 and Flamingo, whose inputs consists of embedding vectors where each vector corresponds to a word. The idea of embeddings is used to map different inputs such as text, images, and continuous-valued vectors into a common space.
- Embed the input sources into a common vector space. For example, images are divided into normalized patches (similar to vision transformer) and fed through a ResNet to create an embedding for each patch. The embedding size is 2048.
- Use a decoder-only transformer to make predictions.
Gato is trained in a purely supervised way as of now (although the authors mention that reinforcement learning is indeed a viable option). The data used to train Gato are
- Simulated control tasks where the training data is generated from specialist state-of-the-art reinforcement learning algorithms. For example, the control tasks include 3D vision navigation and Atari games.
- The MassiveText dataset that contain web-pages, books, code etc.
- Vision-language datasets, e.g., images with captions. These data contains over 2.1 billion samples.
- Proprioception data from robotics.
- Gato exceed average human level on 23 Atari games and more than twice the performance of humans on 11 games.
- On the task of stacking objects of previously unseen shapes, Gato achieves results on parity with specialized algorithms.
- No performance metrics are provided for image captioning but the presented results imply good performance also on this task.