Related papers: Unified Embodied VLM Reasoning with Robotic Action via Autoregressive Discretized Pre-training

Unified Embodied VLM Reasoning with Robotic Action via Autoregressive Discretized Pre-training

URL: http://arxiv.org/abs/2512.24125v2
Date: Thu, 01 Jan 2026 17:42:44 GMT
Title: Unified Embodied VLM Reasoning with Robotic Action via Autoregressive Discretized Pre-training
Authors: Yi Liu, Sukai Wang, Dafeng Wei, Xiaowei Cai, Linqing Zhong, Jiange Yang, Guanghui Ren, Jinyu Zhang, Maoqing Yao, Chuankang Li, Xindong He, Liliang Chen, Jianlan Luo,
Abstract summary: General-purpose robotic systems must achieve broad generalization and high-precision action execution.<n>We introduce Embodied Reasoning Intelligence Quotient (ERIQ), a large-scale embodied reasoning benchmark in robotic manipulation.<n>We propose FACT, a flow-matching-based action tokenizer that converts continuous control into discrete sequences.
Score: 16.28589738595606
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: General-purpose robotic systems operating in open-world environments must achieve both broad generalization and high-precision action execution, a combination that remains challenging for existing Vision-Language-Action (VLA) models. While large Vision-Language Models (VLMs) improve semantic generalization, insufficient embodied reasoning leads to brittle behavior, and conversely, strong reasoning alone is inadequate without precise control. To provide a decoupled and quantitative assessment of this bottleneck, we introduce Embodied Reasoning Intelligence Quotient (ERIQ), a large-scale embodied reasoning benchmark in robotic manipulation, comprising 6K+ question-answer pairs across four reasoning dimensions. By decoupling reasoning from execution, ERIQ enables systematic evaluation and reveals a strong positive correlation between embodied reasoning capability and end-to-end VLA generalization. To bridge the gap from reasoning to precise execution, we propose FACT, a flow-matching-based action tokenizer that converts continuous control into discrete sequences while preserving high-fidelity trajectory reconstruction. The resulting GenieReasoner jointly optimizes reasoning and action in a unified space, outperforming both continuous-action and prior discrete-action baselines in real-world tasks. Together, ERIQ and FACT provide a principled framework for diagnosing and overcoming the reasoning-precision trade-off, advancing robust, general-purpose robotic manipulation. Project page: https://geniereasoner.github.io/GenieReasoner/

Related papers

Act, Think or Abstain: Complexity-Aware Adaptive Inference for Vision-Language-Action Models [7.802379200026965]
We propose an adaptive framework that dynamically routes VLA execution based on the complexity of the perceived state.<n>Our approach transforms the VLA's vision-language backbone into an active detection tool by projecting latent embeddings into an ensemble of parametric and non-parametric estimators.
arXiv Detail & Related papers (2026-03-05T13:14:41Z)
ALIVE: Awakening LLM Reasoning via Adversarial Learning and Instructive Verbal Evaluation [4.265094703231012]
We introduce textbfALIVE (emphAdrial Learning with Instructive Verbal Evaluation), a hands-free alignment framework.<n>By coupling adversarial learning with instructive verbal feedback, ALIVE enables models to internalize evaluative criteria directly from raw corpora.<n>With identical data and compute, ALIVE achieves markedly improved cross-domain generalization, and higher self-correction rates.
arXiv Detail & Related papers (2026-02-05T09:20:23Z)
From Human Intention to Action Prediction: A Comprehensive Benchmark for Intention-driven End-to-End Autonomous Driving [67.23302649816466]
Current autonomous driving systems operate at a level of intelligence akin to following simple steering commands.<n>We introduce Intention-Drive, the first comprehensive benchmark designed to evaluate the ability to translate high-level human intent into safe and precise driving actions.
arXiv Detail & Related papers (2025-12-13T11:59:51Z)
Mind to Hand: Purposeful Robotic Control via Embodied Reasoning [12.275897522668858]
We introduce Lumo-1, a model that unifies robot reasoning ("mind") with robot action ("hand")<n>Our approach builds upon the general multi-modal reasoning capabilities of pre-trained vision-language models (VLMs)<n>We integrate reinforcement learning to further refine reasoning-action consistency and close the loop between semantic inference and motor control.
arXiv Detail & Related papers (2025-12-09T13:19:37Z)
DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action [62.70893433854428]
We propose DualVLA, which enhances action performance through carefully designed post-training while still preserving reasoning capability.<n>Experiments show that DualVLA achieves an average success rate of 61.0 in SimplerEnv and an average score of 65.4 across eight competitive multimodal benchmarks.
arXiv Detail & Related papers (2025-11-27T06:03:53Z)
DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models [51.76664843721462]
DeepThinkVLA is a new architecture for Vision-Language-Action models.<n>It generates sequential CoT with causal attention and switches to bidirectional attention for fast decoding of action vectors.<n>It achieves a 97.0% success rate on the LIBERO benchmark.
arXiv Detail & Related papers (2025-10-31T05:26:16Z)
From <Answer> to <Think>: Multidimensional Supervision of Reasoning Process for LLM Optimization [62.07990937720985]
Dimension-level Reward Model (DRM) is a new supervision framework for Large Language Models.<n>DRM evaluates the quality of a reasoning process along three fundamental, complementary, and interpretable dimensions.<n> Experimental results show that DRM provides effective supervision signals, guides the optimization of LLMs and enhances their reasoning ability.
arXiv Detail & Related papers (2025-10-13T14:29:15Z)
IntentionVLA: Generalizable and Efficient Embodied Intention Reasoning for Human-Robot Interaction [51.130510883952546]
Vision-Language-Action (VLA) models leverage pretrained vision-language models (VLMs) to couple perception with robotic control.<n>We propose textbfIntentionVLA, a VLA framework with a curriculum training paradigm and an efficient inference mechanism.<n>Our proposed method first leverages carefully designed reasoning data that combine intention inference, spatial grounding, and compact embodied reasoning.
arXiv Detail & Related papers (2025-10-09T04:49:46Z)
Perception-Consistency Multimodal Large Language Models Reasoning via Caption-Regularized Policy Optimization [72.30168853571216]
multimodal large language models excel at tasks that integrate visual perception with symbolic reasoning.<n>CapPO integrates two key mechanisms: (1) a caption-based consistency regularization, which minimizes the divergence between responses conditioned on raw images and those conditioned on captions, and (2) a KL-weighted advantage estimation scheme, which adaptively scales reinforcement signals to strengthen perceptually consistent trajectories.
arXiv Detail & Related papers (2025-09-26T04:32:26Z)
Sycophancy Mitigation Through Reinforcement Learning with Uncertainty-Aware Adaptive Reasoning Trajectories [58.988535279557546]
We introduce textbf sycophancy Mitigation through Adaptive Reasoning Trajectories.<n>We show that SMART significantly reduces sycophantic behavior while preserving strong performance on out-of-distribution inputs.
arXiv Detail & Related papers (2025-09-20T17:09:14Z)
Adaptive Termination for Multi-round Parallel Reasoning: An Universal Semantic Entropy-Guided Framework [12.361554676966552]
Recent advances in large language models (LLMs) have accelerated progress toward artificial general intelligence.<n>We aim to design a flexible test-time collaborative inference framework that exploits the complementary strengths of both sequential and parallel reasoning paradigms.
arXiv Detail & Related papers (2025-07-09T13:28:35Z)
Robotic Control via Embodied Chain-of-Thought Reasoning [86.6680905262442]
Key limitation of learned robot control policies is their inability to generalize outside their training data.<n>Recent works on vision-language-action models (VLAs) have shown that the use of large, internet pre-trained vision-language models can substantially improve their robustness and generalization ability.<n>We introduce Embodied Chain-of-Thought Reasoning (ECoT) for VLAs, in which we train VLAs to perform multiple steps of reasoning about plans, sub-tasks, motions, and visually grounded features before predicting the robot action.
arXiv Detail & Related papers (2024-07-11T17:31:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.