Related papers: Autoregressive Action Sequence Learning for Robotic Manipulation

Autoregressive Action Sequence Learning for Robotic Manipulation

URL: http://arxiv.org/abs/2410.03132v3
Date: Mon, 18 Nov 2024 02:06:46 GMT
Title: Autoregressive Action Sequence Learning for Robotic Manipulation
Authors: Xinyu Zhang, Yuhan Liu, Haonan Chang, Liam Schramm, Abdeslam Boularias,
Abstract summary: Existing autoregressive architectures generate end-effector waypoints sequentially as word tokens in language modeling. We extend causal transformers' single-token prediction to support predicting a variable number of tokens in a single step. We propose the Autoregressive Policy architecture, which solves manipulation tasks by generating hybrid action sequences.
Score: 32.9580007141312
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Designing a universal policy architecture that performs well across diverse robots and task configurations remains a key challenge. In this work, we address this by representing robot actions as sequential data and generating actions through autoregressive sequence modeling. Existing autoregressive architectures generate end-effector waypoints sequentially as word tokens in language modeling, which are limited to low-frequency control tasks. Unlike language, robot actions are heterogeneous and often include continuous values -- such as joint positions, 2D pixel coordinates, and end-effector poses -- which are not easily suited for language-based modeling. Based on this insight, we introduce a straightforward enhancement: we extend causal transformers' single-token prediction to support predicting a variable number of tokens in a single step through our Chunking Causal Transformer (CCT). This enhancement enables robust performance across diverse tasks of various control frequencies, greater efficiency by having fewer autoregression steps, and lead to a hybrid action sequence design by mixing different types of actions and using a different chunk size for each action type. Based on CCT, we propose the Autoregressive Policy (ARP) architecture, which solves manipulation tasks by generating hybrid action sequences. We evaluate ARP across diverse robotic manipulation environments, including Push-T, ALOHA, and RLBench, and show that ARP, as a universal architecture, outperforms the environment-specific state-of-the-art in all tested benchmarks, while being more efficient in computation and parameter sizes. Videos of our real robot demonstrations, all source code and the pretrained models of ARP can be found at http://github.com/mlzxy/arp.

Related papers

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [56.424032454461695]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences. Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations. Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z)
FAST: Efficient Action Tokenization for Vision-Language-Action Models [98.15494168962563]
We propose a new compression-based tokenization scheme for robot actions, based on the discrete cosine transform. Based on FAST, we release FAST+, a universal robot action tokenizer, trained on 1M real robot action trajectories.
arXiv Detail & Related papers (2025-01-16T18:57:04Z)
Coarse-to-fine Q-Network with Action Sequence for Data-Efficient Robot Learning [62.3886343725955]
We introduce Coarse-to-fine Q-Network with Action Sequence (CQN-AS), a novel value-based reinforcement learning algorithm. We study our algorithm on 53 robotic tasks with sparse and dense rewards, as well as with and without demonstrations.
arXiv Detail & Related papers (2024-11-19T01:23:52Z)
Diffusion Transformer Policy [48.50988753948537]
Diffusion Transformer Policy pretrained on diverse robot data can generalize to different embodiments. The proposed approach achieves state-of-the-art performance with only a single third-view camera stream in the Calvin novel task setting.
arXiv Detail & Related papers (2024-10-21T12:43:54Z)
PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation [68.17081518640934]
We propose a PrIrmitive-driVen waypOinT-aware world model for Robotic manipulation (PIVOT-R) PIVOT-R consists of a Waypoint-aware World Model (WAWM) and a lightweight action prediction module. Our PIVOT-R outperforms state-of-the-art open-source models on the SeaWave benchmark, achieving an average relative improvement of 19.45% across four levels of instruction tasks.
arXiv Detail & Related papers (2024-10-14T11:30:18Z)
Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers [41.069074375686164]
We propose Heterogeneous Pre-trained Transformers (HPT), which pre-train a trunk of a policy neural network to learn a task and embodiment shared representation. We conduct experiments to investigate the scaling behaviors of training objectives, to the extent of 52 datasets. HPTs outperform several baselines and enhance the fine-tuned policy performance by over 20% on unseen tasks.
arXiv Detail & Related papers (2024-09-30T17:39:41Z)
ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation [31.211870350260703]
Keypoint Constraints (ReKep) is a visually-grounded representation for constraints in robotic manipulation. ReKep is expressed as Python functions mapping a set of 3D keypoints the environment to a numerical cost. We present system implementations on a wheeled single-arm platform and a stationary dual-arm platform.
arXiv Detail & Related papers (2024-09-03T06:45:22Z)
IRASim: A Fine-Grained World Model for Robot Manipulation [24.591694756757278]
We present IRASim, a novel world model capable of generating videos with fine-grained robot-object interaction details.<n>We train a diffusion transformer and introduce a novel frame-level action-conditioning module within each transformer block to explicitly model and strengthen the action-frame alignment.
arXiv Detail & Related papers (2024-06-20T17:50:16Z)
SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation [62.58480650443393]
Segment Anything (SAM) is a vision-foundation model for generalizable scene understanding and sequence imitation. We develop a novel multi-channel heatmap that enables the prediction of the action sequence in a single pass.
arXiv Detail & Related papers (2024-05-30T00:32:51Z)
RoboScript: Code Generation for Free-Form Manipulation Tasks across Real and Simulation [77.41969287400977]
This paper presents textbfRobotScript, a platform for a deployable robot manipulation pipeline powered by code generation. We also present a benchmark for a code generation benchmark for robot manipulation tasks in free-form natural language. We demonstrate the adaptability of our code generation framework across multiple robot embodiments, including the Franka and UR5 robot arms.
arXiv Detail & Related papers (2024-02-22T15:12:00Z)
Robot Learning with Sensorimotor Pre-training [98.7755895548928]
We present a self-supervised sensorimotor pre-training approach for robotics. Our model, called RPT, is a Transformer that operates on sequences of sensorimotor tokens. We find that sensorimotor pre-training consistently outperforms training from scratch, has favorable scaling properties, and enables transfer across different tasks, environments, and robots.
arXiv Detail & Related papers (2023-06-16T17:58:10Z)
PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pre-Training [25.50131893785007]
This work introduces a paradigm for pre-training a general purpose representation that can serve as a starting point for multiple tasks on a given robot. We present the Perception-Action Causal Transformer (PACT), a generative transformer-based architecture that aims to build representations directly from robot data in a self-supervised fashion. We show that finetuning small task-specific networks on top of the larger pretrained model results in significantly better performance compared to training a single model from scratch for all tasks simultaneously.
arXiv Detail & Related papers (2022-09-22T16:20:17Z)
OSCAR: Data-Driven Operational Space Control for Adaptive and Robust Robot Manipulation [50.59541802645156]
Operational Space Control (OSC) has been used as an effective task-space controller for manipulation. We propose OSC for Adaptation and Robustness (OSCAR), a data-driven variant of OSC that compensates for modeling errors. We evaluate our method on a variety of simulated manipulation problems, and find substantial improvements over an array of controller baselines.
arXiv Detail & Related papers (2021-10-02T01:21:38Z)
Deep Imitation Learning for Bimanual Robotic Manipulation [70.56142804957187]
We present a deep imitation learning framework for robotic bimanual manipulation. A core challenge is to generalize the manipulation skills to objects in different locations. We propose to (i) decompose the multi-modal dynamics into elemental movement primitives, (ii) parameterize each primitive using a recurrent graph neural network to capture interactions, and (iii) integrate a high-level planner that composes primitives sequentially and a low-level controller to combine primitive dynamics and inverse kinematics control.
arXiv Detail & Related papers (2020-10-11T01:40:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.