Policies
What are the latest AI robotics models?
A policy is the brain of your robot. It tells the robot what to do in a given situation. Mathematically, it’s a function that maps the current state of the robot to an action .
- the state is usually the position of the robot, the cameras and sensors feed, and the text instructions.
- the actions depends on the robot. For example, high level instructions (“move left”, “move right”), the 6-DOF (degrees of freedom) cartesian position (x, y, z, rx, ry, rz), the angles of the joints…
- the policy is basically the AI model that controls the robot. It can be as simple as a hard-coded rule or as complex as a deep neural network.
Recent breakthrough have allowed to leverage the transformer architecture and internet-scale data to train more advanced policies, that radically differ from old school robotics and reinforcement learning.
Vision-Language Action Models (VLAs)
The latest paradigm since 2024 in AI robotics are Vision-Language Action Models (VLAs). They leverage Large Language Models (LLMs) to understand and act on human instructions.
- VLA models are particularly well-suited for robotics because they function as a brain.
- VLA process both images and text instructions to predict the next action.
- VLA were trained using internet-scale data, so they have some common sense.
Unlike AI models that generate text (like ChatGPT), these models output actions, such as move left.
Essentially, with VLA, you could prompt your robot to “pick up the red ball” and it would do so.
The phospho dev kit helps you learn and experiment with VLAs.
What are the latest architectures in AI robotics?
Since 2024, there have been breakthroughs in AI robotics. Here are some of the latest ideas in AI robotics.
ACT (Action Chunking Transformer)
ACT (Action Chunking Transformer) is a popular repo that that showcases how to use transformers for robotics. The model is trained to predict the action sequences based on the current state of the robot and cameras’ images. ACT is an efficient way to do imitation learning. Learn more.
How it works:
- You record episodes of your robot performing a task. (e.g., picking up a lego brick).
- The model learns from this data and enacts a policy based on it. (e.g., it will pick up the lego brick no matter where it is placed).
Why use ACT?
- Typically requires ~30 episodes for training
- Can run on an RTX 3000 series GPU in less than 30 minutes.
- This is a great starting point to get your hands dirty with AI in robotics.
- You don’t need prompts to train the model.
Train ACT on Replicate
A few dozens of episodes are enough to train ACT to reproduce human demonstrations.
OpenVLA
OpenVLA is a great repo that showcases a more advanced model designed for complex robotics tasks. The architecture of OpenVLA include a Llama-2-7b model that receives a prompt describing the task. This gives the model some common sense and allows it to generalize to new tasks.
Key differences with ACT:
- Training such a model requires more data and computational power.
- Typically needs ~100 episodes for training
- Training takes a few hours on an NVIDIA A100 GPU.
For more details, check out Nvidia’s blog post on OpenVLA and the arxiV paper.
Diffusion Transformers
Diffusion transformers are a family of models based on the diffusion process. Instead of deterministically mapping states to actions, the model hallucinates (generates) the most probable next action based on patterns learned from data. You can also see this as denoizing actions. This mechanism is common to many image generation models (e.g., DALL-E, Stable Diffusion, Midjourney…)
Why consider Diffusion Transformers?
- The currently #1 model in robotics on Hugging Face is a diffusion transformer called RDT-1b.
- Fine tuning the model on your own data is expensive but inference is fast.
What are the latest models in AI robotics?
Here are some of the latest models that combine ideas from ACT, OpenVLA, and Diffusion Transformers.
gr00t-n1 by Nvidia
GR00T-N1 (Generalist Robot 00 Technology) is NVIDIA’s foundation model for robots. It’s a performant models, trained on lots of data, which is ideal for fine tuning. The model weights are available on Hugging Face.
GR00T-N1 combines both VLA for language understanding and Diffusion transformers for fine grained controls. For details, see their paper on arxiv
Key features:
- Processes natural language instructions, camera feeds, and sensor data to generate actions.
- Based on denoizing of the action space, kind of like a Diffusion transformer.
- Trained on a massive datasets of human movements, 3D environments, and AI-generated data.
Why use GR00T-N1?
- Typically requires ~50 episodes for training.
- Supports prompting and zero-shot learning for tasks not explicitly seen during training.
- Small model size (2B parameters) for efficient fine-tuning and fast inference on Nvidia Jetson devices.
Train gr00t-n1 on Replicate
The gr00t-N1 model is a promptable model by NVIDIA
pi0 and pi-0 FAST by Physical Intelligence
pi0, also written as π₀ or pi zero, is a a flow-based diffusion vision-language-action model (VLA) by Physical Intelligence. The weight of pi0 are open sourced on Hugging Face. Learn more.
pi0 FAST, also written as π₀-FAST or pi zero FAST, is an autoregressive VLA, based on the FAST action tokenizer. Similar to how LLMs generate text token by token, pi0 FAST generates actions token by token. Learn more.
RT-2 and AutoRT by Google DeepMind
RT-2 is Google DeepMind’s twist on VLAs. It’s a closed-source model, very similar to OpenVLA. based on the Palm architecture. The model is trained on a large dataset of human demonstrations. Learn more.
AutoRT is a framework by Google DeepMind, designed for robot fleets and data collection. A LLM is used to generate “to do lists” for robots based on descriptions of the environment. The to do lists tasks are then executed by teleoperators, a scripted pick policy, or RT-2 (Google’s VLA). Learn more.
LeRobot Integration
LeRobot is a github repo by Hugging Face which implements training scripts for various policies in a standardized way. Supported policies include:
- act
- diffusion
- pi0
- tdmpc
- vqbet
Was this page helpful?