💾 Google Deepmind RT-2. Is it DUM-E?

2023 Aug Week 1 Datumo Newsletter

Editor: Jodie Jung

Robot Transformer, RT-2

From VLM to VLA 'RT-2'. Image. Google Deepmind.

Do you remember DUM-E, the assistant robot of Tony Stark? Last Friday, Google DeepMind announced a new AI model 'RT-2'(Robotics Transformer). RT-2 is a transformer-based VLA(Vision-Language-Action) model that improved upon the RT-1, which was unveiled in December last year.

Robots operating in the real world will encounter a variety of situations and will need to perform adequately even when faced with objects and environments they have never seen before. For example, they should be able to distinguish between the French and German flags and identify and catch an object falling from a table.

However, it has been difficult to implement recognition and reasoning capabilities using general knowledge in robotic AI. There were two main problems. First, there were limitations in the traditional method of manually collecting millions of pieces of robotic behavior data. Second, even if the hard-earned data was trained, there was a lack of models that could apply this to general and scalable real-time inference.

Without training labeled data for specific tasks, showing outstanding performance in general tasks is a common characteristic found in large language models(LLMs) that have been pre-trained with vast web data(Internet-Scale).

In this letter, we will look at how researchers train the model with both web and robotics data and apply the knowledge to robot control.

Co-fine-tune SOTA vision-language models

The pre-trained VLM model undergoes co-fine-tuning.

Researchers propose a method of co-fine-tuning the state-of-the-art (SOTA) vision-language model with robot behavior data and vision-language data.

Researchers adapt Pathways Language and Image model(PaLI-X) and Pathways Language model Embodied(PaLM-E) to act as the backbones of RT-2. RT-1 robot demonstration data which was collected with 13 robots over 17 months in an office kitchen environment were also used.

The three question-and-answer examples provided at the top of the image illustrate the following:

Internet-Scale VQA - Robot Action Data

Q1. English: What is happening in the image?
(A grey donkey walks down the street)

Q2. French: What could you do with these items?
(You could bake a cake)

Q3. What should the robot do to <task>?
(Δ Translation [ 0.1, 0.2, 0 ] , Δ Rotation [ 10°, 25°, -7° ] )

With the combined learning of Internet-scale Visual Question Answering(VQA) data, which allows understanding of situations in images, and robotic trajectory data, the model has gained the ability to respond to user commands using basic inference.

To summarize, "A visual-language model (VLM) pre-trained on web-scale data is learning from RT-1 robotics data to become RT-2, a visual-language-action (VLA) model that can control a robot."

Robot actions represented as a sequence of integers

Representation of an action string used in RT-2 training.
An example of such a string could be a sequence of robot action token numbers.

Then how does the robotic trajectory data align with the large vision-language model? Researchers convert robot behaviors into text tokens from the model output, and these tokens are treated in the same way as language tokens.

For example, a robot's behavior can be represented as a bundle of eight integers ranging from 1 to 256, such as "1 128 91 241 5 101 127 217". The beginning part is a flag section indicating whether to end or continue the current episode, the middle part represents the robot's positional change and rotational change information, and the final part indicates the level of expansion of the robot's gripper.

RT-2 Architect overview. image. Google DeepMind.

The final model can directly predict the actions the robot should perform based on the robot's camera images. Additionally, DeepMind explained that,

"During inference, the text tokens are de-tokenized into robot actions, enabling closed loop control"

This means that the text tokens created for learning and inference(commands that the robot can understand and execute) can be converted(detokenized) into actual robot actions during inference.

Closed-loop control is one method for a robot to achieve its goals by continuously monitoring and adjusting its own actions. By doing this, the robot can enhance its control capabilities by utilizing the knowledge already possessed by the backbone model. Next, we will examine how RT-2 can become smarter based on its high natural language processing capabilities.

Robot applied with CoT, Chain of Thought

Rollouts of RT-2 with chain-of-thought reasoning,
where RT-2 generates both a plan and an action.

Researchers have also tried applying the Chain-of-Thought prompting method, used in large-scale language generation models, to RT-2. The Chain-of-Thought approach involves informing the generation model of the steps of work and inference thinking through examples before the main question, thus improving answer accuracy and quality.

Researchers added a 'Plan' stage to the data format, first explaining the purpose of the robot's action in natural language, then presenting the actual action tokens. For example, the model clearly defined the cause and effect by adding plan information (Pick up a chocolate bar) between the input command (I'm hungry) and the action (1 128 124 136 121 158 111 255).

It is predicted that if Chain-of-Thought is used properly, even more high-dimensional inferences and actions will be possible. Examples of actions cited in the abstract introducing RT-2 include 'Recommend an energy drink to a tired person' and 'Try using a rock if there is no hammer nearby'.

Let me end this letter with a video of a robot moving a blue block as instructed, featuring RT-2 :)

Aug Week 1 AI News Clip

#1. Anthropic, Google, Microsoft, and OpenAI are launching the Frontier Model Forum #link

Anthropic, Google, Microsoft, and OpenAI are launching the Frontier Model Forum, an industry body focused on ensuring safe and responsible development of frontier AI models.

#2. ChatGPT Android App Released #link

This official app is free, syncs your history across devices, and brings you the newest model improvements from OpenAI.

#3. Personalization is the cornerstone of Bard AI's Vision #link

Google's upcoming Android 14 is set to revolutionise smartphones with innovate AI features, including generative AI to offer a personalised and seamless user experience.

The Data for Smarter AI

Datumo is an all-in-one data platform for your smarter AI.

Datumo Inc.

📋 contact@datumo.com

Unsubscribe