FAST: Efficient Action Tokenization for Vision-Language-Action Models
- URL: http://arxiv.org/abs/2501.09747v1
- Date: Thu, 16 Jan 2025 18:57:04 GMT
- Title: FAST: Efficient Action Tokenization for Vision-Language-Action Models
- Authors: Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, Sergey Levine,
- Abstract summary: We propose a new compression-based tokenization scheme for robot actions, based on the discrete cosine transform.
Based on FAST, we release FAST+, a universal robot action tokenizer, trained on 1M real robot action trajectories.
- Score: 98.15494168962563
- License:
- Abstract: Autoregressive sequence models, such as Transformer-based vision-language action (VLA) policies, can be tremendously effective for capturing complex and generalizable robotic behaviors. However, such models require us to choose a tokenization of our continuous action signals, which determines how the discrete symbols predicted by the model map to continuous robot actions. We find that current approaches for robot action tokenization, based on simple per-dimension, per-timestep binning schemes, typically perform poorly when learning dexterous skills from high-frequency robot data. To address this challenge, we propose a new compression-based tokenization scheme for robot actions, based on the discrete cosine transform. Our tokenization approach, Frequency-space Action Sequence Tokenization (FAST), enables us to train autoregressive VLAs for highly dexterous and high-frequency tasks where standard discretization methods fail completely. Based on FAST, we release FAST+, a universal robot action tokenizer, trained on 1M real robot action trajectories. It can be used as a black-box tokenizer for a wide range of robot action sequences, with diverse action spaces and control frequencies. Finally, we show that, when combined with the pi0 VLA, our method can scale to training on 10k hours of robot data and match the performance of diffusion VLAs, while reducing training time by up to 5x.
Related papers
- RobotDiffuse: Motion Planning for Redundant Manipulator based on Diffusion Model [13.110235244912474]
Redundant manipulators offer enhanced kinematic performance and versatility.
Motion planning for these manipulators is challenging due to increased DOFs and complex, dynamic environments.
This paper introduces RobotDiffuse, a diffusion model-based approach for motion planning in redundant manipulators.
arXiv Detail & Related papers (2024-12-27T07:34:54Z) - Coarse-to-fine Q-Network with Action Sequence for Data-Efficient Robot Learning [62.3886343725955]
We introduce Coarse-to-fine Q-Network with Action Sequence (CQN-AS), a novel value-based reinforcement learning algorithm.
We study our algorithm on 53 robotic tasks with sparse and dense rewards, as well as with and without demonstrations.
arXiv Detail & Related papers (2024-11-19T01:23:52Z) - One-Step Diffusion Policy: Fast Visuomotor Policies via Diffusion Distillation [80.71541671907426]
OneStep Diffusion Policy (OneDP) is a novel approach that distills knowledge from pre-trained diffusion policies into a single-step action generator.
OneDP significantly accelerates response times for robotic control tasks.
arXiv Detail & Related papers (2024-10-28T17:54:31Z) - Latent Action Pretraining from Videos [156.88613023078778]
We introduce Latent Action Pretraining for general Action models (LAPA)
LAPA is an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels.
We propose a method to learn from internet-scale videos that do not have robot action labels.
arXiv Detail & Related papers (2024-10-15T16:28:09Z) - Autoregressive Action Sequence Learning for Robotic Manipulation [32.9580007141312]
Existing autoregressive architectures generate end-effector waypoints sequentially as word tokens in language modeling.
We extend causal transformers' single-token prediction to support predicting a variable number of tokens in a single step.
We propose the Autoregressive Policy architecture, which solves manipulation tasks by generating hybrid action sequences.
arXiv Detail & Related papers (2024-10-04T04:07:15Z) - Affordance-based Robot Manipulation with Flow Matching [6.863932324631107]
We present a framework for assistive robot manipulation.
We tackle two challenges: first, efficiently adapting large-scale models to downstream scene affordance understanding tasks, and second, effectively learning robot action trajectories by grounding the visual affordance model.
We learn robot action trajectories guided by affordances in a supervised flow matching method.
arXiv Detail & Related papers (2024-09-02T09:11:28Z) - LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations.
First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets.
We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z) - Robot Learning with Sensorimotor Pre-training [98.7755895548928]
We present a self-supervised sensorimotor pre-training approach for robotics.
Our model, called RPT, is a Transformer that operates on sequences of sensorimotor tokens.
We find that sensorimotor pre-training consistently outperforms training from scratch, has favorable scaling properties, and enables transfer across different tasks, environments, and robots.
arXiv Detail & Related papers (2023-06-16T17:58:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.