HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model
- URL: http://arxiv.org/abs/2503.10631v2
- Date: Mon, 17 Mar 2025 08:44:28 GMT
- Title: HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model
- Authors: Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Renrui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, Chengkai Hou, Mengdi Zhao, KC alex Zhou, Pheng-Ann Heng, Shanghang Zhang,
- Abstract summary: We introduce HybridVLA, a unified framework that seamlessly integrates autoregressive and diffusion policies within a single large language model.<n>With this recipe, we find that these two forms of action prediction not only reinforce each other but also exhibit varying performance across different tasks.<n>In experiments, HybridVLA outperforms previous state-of-the-art VLA methods across various simulation and real-world tasks.
- Score: 54.64088247291416
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in vision-language models (VLMs) for common-sense reasoning have led to the development of vision-language-action (VLA) models, enabling robots to perform generalized manipulation. Although existing autoregressive VLA methods leverage large-scale pretrained knowledge, they disrupt the continuity of actions. Meanwhile, some VLA methods incorporate an additional diffusion head to predict continuous actions, relying solely on VLM-extracted features, which limits their reasoning capabilities. In this paper, we introduce HybridVLA, a unified framework that seamlessly integrates the strengths of both autoregressive and diffusion policies within a single large language model, rather than simply connecting them. To bridge the generation gap, a collaborative training recipe is proposed that injects the diffusion modeling directly into the next-token prediction. With this recipe, we find that these two forms of action prediction not only reinforce each other but also exhibit varying performance across different tasks. Therefore, we design a collaborative action ensemble mechanism that adaptively fuses these two predictions, leading to more robust control. In experiments, HybridVLA outperforms previous state-of-the-art VLA methods across various simulation and real-world tasks, including both single-arm and dual-arm robots, while demonstrating stable manipulation in previously unseen configurations.
Related papers
- CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [89.44024245194315]
We introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs)
We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens.
Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks.
arXiv Detail & Related papers (2025-03-27T22:23:04Z) - Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [56.424032454461695]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences.
Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations.
Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z) - Enhanced Continual Learning of Vision-Language Models with Model Fusion [16.764069327701186]
Vision-Language Models (VLMs) represent a breakthrough in artificial intelligence.
VLMs are susceptible to catastrophic forgetting when sequentially fine-tuned on multiple downstream tasks.
We propose Continual Decoupling-Unifying (ConDU), a novel approach, by introducing model fusion into continual learning.
arXiv Detail & Related papers (2025-03-12T15:48:13Z) - Accelerating Vision-Language-Action Model Integrated with Action Chunking via Parallel Decoding [24.1236728596359]
Vision-Language-Action (VLA) models demonstrate remarkable potential for generalizable robotic manipulation.<n>We propose PD-VLA, the first parallel decoding framework for VLA models integrated with action chunking.<n>Our framework reformulates autoregressive decoding as a nonlinear system solved by parallel fixed-point iterations.
arXiv Detail & Related papers (2025-03-04T06:12:08Z) - Doubly-Universal Adversarial Perturbations: Deceiving Vision-Language Models Across Both Images and Text with a Single Perturbation [15.883062174902093]
Large Vision-Language Models (VLMs) have demonstrated remarkable performance across multimodal tasks by integrating vision encoders with large language models (LLMs)<n>We introduce a novel UAP specifically designed for VLMs: the Doubly-Universal Adversarial Perturbation (Doubly-UAP)
arXiv Detail & Related papers (2024-12-11T05:23:34Z) - ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer [95.80384464922147]
Continuous visual generation requires the full-sequence diffusion-based approach.<n>We present ACDiT, an Autoregressive blockwise Conditional Diffusion Transformer.<n>We demonstrate that ACDiT can be seamlessly used in visual understanding tasks despite being trained on the diffusion objective.
arXiv Detail & Related papers (2024-12-10T18:13:20Z) - Vision-Language-Action Model and Diffusion Policy Switching Enables Dexterous Control of an Anthropomorphic Hand [2.7036595757881323]
We propose a hybrid control method that combines the relative advantages of a fine-tuned Vision-Language-Action model and diffusion models.
We demonstrate this model switching approach results in a over 80% success rate compared to under 40% when only using a VLA model.
arXiv Detail & Related papers (2024-10-17T20:49:45Z) - Feedback-based Modal Mutual Search for Attacking Vision-Language Pre-training Models [8.943713711458633]
We propose a new attack paradigm called Feedback-based Modal Mutual Search (FMMS)
FMMS aims to push away the matched image-text pairs while randomly drawing mismatched pairs closer in feature space.
This is the first work to exploit target model feedback to explore multi-modality adversarial boundaries.
arXiv Detail & Related papers (2024-08-27T02:31:39Z) - InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusion [53.90516061351706]
We present InterHandGen, a novel framework that learns the generative prior of two-hand interaction.
For sampling, we combine anti-penetration and synthesis-free guidance to enable plausible generation.
Our method significantly outperforms baseline generative models in terms of plausibility and diversity.
arXiv Detail & Related papers (2024-03-26T06:35:55Z) - Variance-Preserving-Based Interpolation Diffusion Models for Speech
Enhancement [53.2171981279647]
We present a framework that encapsulates both the VP- and variance-exploding (VE)-based diffusion methods.
To improve performance and ease model training, we analyze the common difficulties encountered in diffusion models.
We evaluate our model against several methods using a public benchmark to showcase the effectiveness of our approach.
arXiv Detail & Related papers (2023-06-14T14:22:22Z) - David helps Goliath: Inference-Time Collaboration Between Small
Specialized and Large General Diffusion LMs [49.822063966687175]
Diffusion-based language models are emerging as a promising alternative to autoregressive LMs.
We propose methods to scale a recently proposed diffusion model SSD-LM from 0.4B to 13B parameters.
We show that SSD-2 facilitates novel ensembles with 100x smaller models that can be customized and deployed by individual users.
arXiv Detail & Related papers (2023-05-24T06:22:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.