Hybrid Training for Vision-Language-Action Models
- URL: http://arxiv.org/abs/2510.00600v1
- Date: Wed, 01 Oct 2025 07:27:15 GMT
- Title: Hybrid Training for Vision-Language-Action Models
- Authors: Pietro Mazzaglia, Cansu Sancaktar, Markus Peschl, Daniel Dijkman,
- Abstract summary: We propose a framework that enables Vision-Language-Action models to learn from thoughts and benefit from associated performance gains.<n>By learning to conditionally predict a diverse set of outputs, HyT supports flexibility at inference time, enabling the model to either predict actions directly, generate thoughts or follow instructions.
- Score: 10.451756630631609
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Using Large Language Models to produce intermediate thoughts, a.k.a. Chain-of-thought (CoT), before providing an answer has been a successful recipe for solving complex language tasks. In robotics, similar embodied CoT strategies, generating thoughts before actions, have also been shown to lead to improved performance when using Vision-Language-Action models (VLAs). As these techniques increase the length of the model's generated outputs to include the thoughts, the inference time is negatively affected. Delaying an agent's actions in real-world executions, as in robotic manipulation settings, strongly affects the usability of a method, as tasks require long sequences of actions. However, is the generation of long chains-of-thought a strong prerequisite for achieving performance improvements? In this work, we explore the idea of Hybrid Training (HyT), a framework that enables VLAs to learn from thoughts and benefit from the associated performance gains, while enabling the possibility to leave out CoT generation during inference. Furthermore, by learning to conditionally predict a diverse set of outputs, HyT supports flexibility at inference time, enabling the model to either predict actions directly, generate thoughts or follow instructions. We evaluate the proposed method in a series of simulated benchmarks and real-world experiments.
Related papers
- Synergizing Understanding and Generation with Interleaved Analyzing-Drafting Thinking [154.2388970262703]
Unified Vision-Language Models (UVLMs) aim to advance multimodal learning by supporting both understanding and generation within a single framework.<n>We introduce the interleaved Analyzing-Drafting problem-solving loop (AD-Loop), a new think paradigm that alternates between analytic and drafting operations.<n>By interleaving textual thoughts with visual thoughts, AD-Loop enables models to iteratively refine both comprehension and outputs, fostering genuine synergy.
arXiv Detail & Related papers (2026-02-24T23:26:09Z) - Generative Actor Critic [74.04971271003869]
Generative Actor Critic (GAC) is a novel framework that decouples sequential decision-making by reframing textitpolicy evaluation as learning a generative model of the joint distribution over trajectories and returns.<n>Experiments on Gym-MuJoCo and Maze2D benchmarks demonstrate GAC's strong offline performance and significantly enhanced offline-to-online improvement compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-12-25T06:31:11Z) - Do What You Say: Steering Vision-Language-Action Models via Runtime Reasoning-Action Alignment Verification [17.948161564138033]
Reasoning Vision Language Action (VLA) models improve robotic instruction-following by generating step-by-step textual plans before low-level actions.<n>But even with a correct textual plan, the generated actions can still miss the intended outcomes in the plan, especially in out-of-distribution scenarios.<n>We formalize this phenomenon as a lack of embodied CoT faithfulness, and introduce a training-free, runtime policy steering method for reasoning-action alignment.
arXiv Detail & Related papers (2025-10-18T00:38:45Z) - FOSSIL: Harnessing Feedback on Suboptimal Samples for Data-Efficient Generalisation with Imitation Learning for Embodied Vision-and-Language Tasks [45.65159253753118]
This work explores how agents trained with imitation learning can learn robust representations from both optimal and suboptimal demonstrations.<n>We provide language feedback embeddings as part of the input sequence into a Transformer-based policy.<n>We test our approach on a range of embodied Vision-and-Language tasks in our custom BabyAI-XGen environment.
arXiv Detail & Related papers (2025-10-13T11:55:21Z) - FastGRPO: Accelerating Policy Optimization via Concurrency-aware Speculative Decoding and Online Draft Learning [11.68914161151634]
Group relative policy optimization (GRPO) has demonstrated significant potential in improving the reasoning capabilities of large language models.<n>We propose a speculative decoding framework that adjusts the drafting and verification strategy according to real-time levels.<n>We show that the proposed method achieves end-to-end speedups of 2.35x to 2.72x, significantly surpassing baseline approaches in efficiency.
arXiv Detail & Related papers (2025-09-26T02:48:41Z) - ProxyThinker: Test-Time Guidance through Small Visual Reasoners [23.443166029380027]
We propose ProxyThinker, an inference-time technique that enables large models to inherit the visual reasoning capabilities from small, slow-thinking visual reasoners without any training.<n>By subtracting the output of base models from those of RFT reasoners, ProxyThinker elicits the slow-thinking reasoning demonstrated by the emerged behaviors such as self-verification and self-correction.<n>Our implementation efficiently coordinates multiple language models with parallelism techniques and achieves up to 38 $times$ faster inference compared to previous decoding-time methods.
arXiv Detail & Related papers (2025-05-30T17:59:43Z) - CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [89.44024245194315]
We introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs)<n>We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens.<n>Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks.
arXiv Detail & Related papers (2025-03-27T22:23:04Z) - Exploring Training and Inference Scaling Laws in Generative Retrieval [50.82554729023865]
Generative retrieval reformulates retrieval as an autoregressive generation task, where large language models generate target documents directly from a query.<n>We systematically investigate training and inference scaling laws in generative retrieval, exploring how model size, training data scale, and inference-time compute jointly influence performance.
arXiv Detail & Related papers (2025-03-24T17:59:03Z) - Warmup Generations: A Task-Agnostic Approach for Guiding Sequence-to-Sequence Learning with Unsupervised Initial State Generation [34.55224347308013]
Traditional supervised fine-tuning (SFT) strategies for sequence-to-sequence tasks often train models to directly generate the target output.<n>We introduce a task-agnostic framework that enables models to generate intermediate "upwarm" sequences.<n>We show that our approach outperforms traditional SFT methods, and offers a scalable and flexible solution for sequence-to-sequence tasks.
arXiv Detail & Related papers (2025-02-17T20:23:42Z) - Fine-Tuning on Diverse Reasoning Chains Drives Within-Inference CoT Refinement in LLMs [63.36637269634553]
We introduce a novel approach where LLMs are fine-tuned to generate a sequence of Diverse Chains of Thought (DCoT) within a single inference step.<n>We show that fine-tuning on DCoT improves performance over the CoT baseline across model families and scales.<n>Our work is also significant because both quantitative analyses and manual evaluations reveal the observed gains stem from the models' ability to refine an initial reasoning chain.
arXiv Detail & Related papers (2024-07-03T15:01:18Z) - DETAIL: Task DEmonsTration Attribution for Interpretable In-context Learning [75.68193159293425]
In-context learning (ICL) allows transformer-based language models to learn a specific task with a few "task demonstrations" without updating their parameters.<n>We propose an influence function-based attribution technique, DETAIL, that addresses the specific characteristics of ICL.<n>We experimentally prove the wide applicability of DETAIL by showing our attribution scores obtained on white-box models are transferable to black-box models in improving model performance.
arXiv Detail & Related papers (2024-05-22T15:52:52Z) - Few-shot Subgoal Planning with Language Models [58.11102061150875]
We show that language priors encoded in pre-trained language models allow us to infer fine-grained subgoal sequences.
In contrast to recent methods which make strong assumptions about subgoal supervision, our experiments show that language models can infer detailed subgoal sequences without any fine-tuning.
arXiv Detail & Related papers (2022-05-28T01:03:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.