Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
- URL: http://arxiv.org/abs/2601.09708v1
- Date: Wed, 14 Jan 2026 18:59:59 GMT
- Title: Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
- Authors: Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz, Yu-Chiang Frank Wang, Fu-En Yang,
- Abstract summary: We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning.<n>Experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3% reduced inference latency.
- Score: 97.29507133345766
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3\% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery.
Related papers
- ATA: Bridging Implicit Reasoning with Attention-Guided and Action-Guided Inference for Vision-Language Action Models [23.724460067995395]
Vision-Language-Action (VLA) models rely on current observations, including images, language instructions, and robot states, to predict actions and complete tasks.<n>We propose ATA, a training-free framework that introduces implicit reasoning into VLA inference through complementary attention-guided strategies.<n> ATA is a plug-and-play implicit reasoning approach for VLA models, lightweight yet effective.
arXiv Detail & Related papers (2026-03-02T05:56:03Z) - Saliency-Aware Multi-Route Thinking: Revisiting Vision-Language Reasoning [50.62037276161025]
Vision-language models (VLMs) aim to reason by jointly leveraging visual and textual modalities.<n>Key obstacle is that visual inputs are typically provided only once at the start of generation.<n>We propose emphSaliency-Aware Principle (SAP) selection.
arXiv Detail & Related papers (2026-02-18T18:49:56Z) - FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation [11.18316873483782]
Vision-and-Language Navigation (VLN) requires an embodied agent to jointly understand multimodal instructions and visual-spatial context.<n>Recent works demonstrate the potential of Chain-of-Thought (CoT) reasoning for improving interpretability and long-horizon planning.<n>We propose FantasyVLN, a unified implicit reasoning framework that preserves the benefits of CoT reasoning without explicit token overhead.
arXiv Detail & Related papers (2026-01-20T13:54:10Z) - Learning Affordances at Inference-Time for Vision-Language-Action Models [50.93181349331096]
In robotics, Vision-Language-Action models (VLAs) offer a promising path towards solving complex control tasks.<n>We introduce Learning from Inference-Time Execution (LITEN), which connects a VLA low-level policy to a high-level VLM that conditions on past experiences.<n>Our approach iterates between a reasoning phase that generates and executes plans for the low-level VLA, and an assessment phase that reflects on the resulting execution.
arXiv Detail & Related papers (2025-10-22T16:43:29Z) - Fast Thinking for Large Language Models [67.7238685892317]
We introduce Latent Codebooks for Fast Thinking, a framework that uses concise CoT sketches only during training to learn a codebook of discrete strategy priors.<n>At inference, the model conditions on a handful of continuous thinking switches distilled from the codebook in a single pass, enabling strategy-level guidance without producing explicit reasoning tokens.
arXiv Detail & Related papers (2025-09-28T04:19:48Z) - ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning [47.27336786187929]
Vision-language-action (VLA) reasoning tasks require agents to interpret multimodal instructions, perform long-horizon planning, and act adaptively in dynamic environments.<n>Existing approaches typically train VLA models in an end-to-end fashion, directly mapping inputs to actions without explicit reasoning.<n>We propose ThinkAct, a dual-system framework that bridges high-level reasoning with low-level action execution via reinforced visual latent planning.
arXiv Detail & Related papers (2025-07-22T17:59:46Z) - Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing [61.98556945939045]
We propose a framework to learn planning-based reasoning through Direct Preference Optimization (DPO) on collected trajectories.
Our results on challenging logical reasoning benchmarks demonstrate the effectiveness of our learning framework.
arXiv Detail & Related papers (2024-02-01T15:18:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.