Related papers: Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation

Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation

URL: http://arxiv.org/abs/2506.09990v1
Date: Wed, 11 Jun 2025 17:59:13 GMT
Title: Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation
Authors: Wenbo Zhang, Tianrun Hu, Yanyuan Qiao, Hanbo Zhang, Yuchu Qin, Yang Li, Jiajun Liu, Tao Kong, Lingqiao Liu, Xiao Ma,
Abstract summary: Chain-of-Action (CoA) is a visuo-motor policy paradigm built upon Trajectory Autoregressive Modeling.<n>CoA generates an entire trajectory by explicit backward reasoning with task-specific goals.<n>We observe CoA the state-of-the-art performance across 60 RLBench tasks and 8 real-world manipulation tasks.
Score: 37.748111048944274
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present Chain-of-Action (CoA), a novel visuo-motor policy paradigm built upon Trajectory Autoregressive Modeling. Unlike conventional approaches that predict next step action(s) forward, CoA generates an entire trajectory by explicit backward reasoning with task-specific goals through an action-level Chain-of-Thought (CoT) process. This process is unified within a single autoregressive structure: (1) the first token corresponds to a stable keyframe action that encodes the task-specific goals; and (2) subsequent action tokens are generated autoregressively, conditioned on the initial keyframe and previously predicted actions. This backward action reasoning enforces a global-to-local structure, allowing each local action to be tightly constrained by the final goal. To further realize the action reasoning structure, CoA incorporates four complementary designs: continuous action token representation; dynamic stopping for variable-length trajectory generation; reverse temporal ensemble; and multi-token prediction to balance action chunk modeling with global structure. As a result, CoA gives strong spatial generalization capabilities while preserving the flexibility and simplicity of a visuo-motor policy. Empirically, we observe CoA achieves the state-of-the-art performance across 60 RLBench tasks and 8 real-world manipulation tasks.

Related papers

ActionSink: Toward Precise Robot Manipulation with Dynamic Integration of Action Flow [93.00917887667234]
This paper introduces a novel robot manipulation framework, i.e., ActionSink, to pave the way toward precise action estimations.<n>As the name suggests, ActionSink reformulates the actions of robots as action-caused optical flows from videos, called "action flow"<n>Our framework outperformed prior SOTA on the LIBERO benchmark by a 7.9% success rate, and obtained nearly an 8% accuracy gain on the challenging long-horizon visual task LIBERO-Long.
arXiv Detail & Related papers (2025-08-05T08:46:17Z)
CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation [67.1520483301709]
CronusVLA is a unified framework that extends single-frame VLA models to the multi-frame paradigm through an efficient post-training stage.<n>CronusVLA achieves state-of-the-art performance on SimplerEnv with 70.9% success rate, and 12.7% improvement over OpenVLA on LIBERO.
arXiv Detail & Related papers (2025-06-24T17:30:27Z)
ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer [58.49950218437718]
We present ReCoM, an efficient framework for generating high-fidelity and generalizable human body motions synchronized with speech.<n>The core innovation lies in the Recurrent Embedded Transformer (RET), which integrates Dynamic Embedding Regularization (DER) into a Vision Transformer (ViT) core architecture.<n>To enhance model robustness, we incorporate the proposed DER strategy, which equips the model with dual capabilities of noise resistance and cross-domain generalization.
arXiv Detail & Related papers (2025-03-27T16:39:40Z)
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [56.424032454461695]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences.<n>Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations.<n>Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z)
Generating Multimodal Driving Scenes via Next-Scene Prediction [24.84840824118813]
Generative models in Autonomous Driving (AD) enable diverse scene creation, yet existing methods fall short by only capturing a limited range of modalities.<n>We introduce a multimodal generation framework that incorporates four major data modalities, including a novel addition of map modality.<n>Our framework effectively generates complex, realistic driving scenes over extended sequences, ensuring multimodal consistency and offering fine-grained control over scene elements.
arXiv Detail & Related papers (2025-03-19T07:20:16Z)
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model [54.64088247291416]
A fundamental objective of manipulation policy design is to endow robots to comprehend human instructions, reason about scene cues, and execute generalized actions in dynamic environments.<n>Recent autoregressive vision-language-action (VLA) methods inherit common-sense reasoning capabilities from vision-language models (VLMs) for next action-token prediction.<n>We introduce HybridVLA, a unified framework that absorbs the continuous nature of diffusion-based actions and the contextual reasoning of autoregression.
arXiv Detail & Related papers (2025-03-13T17:59:52Z)
CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction [28.761494362934087]
Coarse-to-Fine AutoRegressive Policy (CARP) is a novel paradigm for visuomotor policy learning.<n>It redefines the autoregressive action generation process as a coarse-to-fine, next-scale approach.<n>CARP achieves competitive success rates, with up to a 10% improvement, and delivers 10x faster inference compared to state-of-the-art policies.
arXiv Detail & Related papers (2024-12-09T18:59:18Z)
Diffusion Transformer Policy [48.50988753948537]
We propose a large multi-modal diffusion transformer, dubbed as Diffusion Transformer Policy, to model continuous end-effector actions.<n>By leveraging the scaling capability of transformers, the proposed approach can effectively model continuous end-effector actions across large diverse robot datasets.
arXiv Detail & Related papers (2024-10-21T12:43:54Z)
FCA-RAC: First Cycle Annotated Repetitive Action Counting [30.253568218869237]
We propose a framework called First Cycle Annotated Repetitive Action Counting (FCA-RAC) FCA-RAC contains 4 parts: 1) a labeling technique that annotates each training video with the start and end of the first action cycle, along with the total action count. This technique enables the model to capture the correlation between the initial action cycle and subsequent actions.
arXiv Detail & Related papers (2024-06-18T01:12:43Z)
POTLoc: Pseudo-Label Oriented Transformer for Point-Supervised Temporal Action Localization [26.506893363676678]
This paper proposes POTLoc, a Pseudo-label Oriented Transformer for weakly-supervised Action localization. POTLoc is designed to identify and track continuous action structures via a self-training strategy. It outperforms the state-of-the-art point-supervised methods on THUMOS'14 and ActivityNet-v1.2 datasets.
arXiv Detail & Related papers (2023-10-20T15:28:06Z)
Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization [74.34699679568818]
Weakly supervised temporal action localization (WS-TAL) is a challenging task that aims to localize action instances in the given video with video-level categorical supervision. We propose a cross-modal consensus network (CO2-Net) to tackle this problem.
arXiv Detail & Related papers (2021-07-27T04:21:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.