Mind to Hand: Purposeful Robotic Control via Embodied Reasoning
- URL: http://arxiv.org/abs/2512.08580v2
- Date: Wed, 10 Dec 2025 12:05:30 GMT
- Title: Mind to Hand: Purposeful Robotic Control via Embodied Reasoning
- Authors: Peijun Tang, Shangjin Xie, Binyan Sun, Baifu Huang, Kuncheng Luo, Haotian Yang, Weiqi Jin, Jianan Wang,
- Abstract summary: We introduce Lumo-1, a model that unifies robot reasoning ("mind") with robot action ("hand")<n>Our approach builds upon the general multi-modal reasoning capabilities of pre-trained vision-language models (VLMs)<n>We integrate reinforcement learning to further refine reasoning-action consistency and close the loop between semantic inference and motor control.
- Score: 12.275897522668858
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humans act with context and intention, with reasoning playing a central role. While internet-scale data has enabled broad reasoning capabilities in AI systems, grounding these abilities in physical action remains a major challenge. We introduce Lumo-1, a generalist vision-language-action (VLA) model that unifies robot reasoning ("mind") with robot action ("hand"). Our approach builds upon the general multi-modal reasoning capabilities of pre-trained vision-language models (VLMs), progressively extending them to embodied reasoning and action prediction, and ultimately towards structured reasoning and reasoning-action alignment. This results in a three-stage pre-training pipeline: (1) Continued VLM pre-training on curated vision-language data to enhance embodied reasoning skills such as planning, spatial understanding, and trajectory prediction; (2) Co-training on cross-embodiment robot data alongside vision-language data; and (3) Action training with reasoning process on trajectories collected on Astribot S1, a bimanual mobile manipulator with human-like dexterity and agility. Finally, we integrate reinforcement learning to further refine reasoning-action consistency and close the loop between semantic inference and motor control. Extensive experiments demonstrate that Lumo-1 achieves significant performance improvements in embodied vision-language reasoning, a critical component for generalist robotic control. Real-world evaluations further show that Lumo-1 surpasses strong baselines across a wide range of challenging robotic tasks, with strong generalization to novel objects and environments, excelling particularly in long-horizon tasks and responding to human-natural instructions that require reasoning over strategy, concepts and space.
Related papers
- Mechanistic Finetuning of Vision-Language-Action Models via Few-Shot Demonstrations [76.79742393097358]
Vision-Language Action (VLAs) models promise to extend the remarkable success of vision-language models (VLMs) to robotics.<n>Existing fine-tuning methods lack specificity, adapting the same set of parameters regardless of a task's visual, linguistic, and physical characteristics.<n>Inspired by functional specificity in neuroscience, we hypothesize that it is more effective to finetune sparse model representations specific to a given task.
arXiv Detail & Related papers (2025-11-27T18:50:21Z) - RoboGPT-R1: Enhancing Robot Planning with Reinforcement Learning [6.12099996406339]
We propose RoboGPT-R1, a two-stage fine-tuning framework for embodied planning.<n>In this framework, supervised training acquires foundational knowledge through expert sequences, followed by RL to address the model's shortcomings in visual-spatial understanding and reasoning.<n>The reasoning model, trained on Qwen2.5-VL-3B, significantly outperforms the larger-scale model, GPT-4o-mini, by 21.33% and surpasses other work trained on Qwen2.5-VL-7B by 20.33% on the EmbodiedBench benchmark.
arXiv Detail & Related papers (2025-10-16T16:04:35Z) - Executable Analytic Concepts as the Missing Link Between VLM Insight and Precise Manipulation [70.8381970762877]
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in semantic reasoning and task planning.<n>We introduce GRACE, a novel framework that grounds VLM-based reasoning through executable analytic concepts.<n>G GRACE provides a unified and interpretable interface between high-level instruction understanding and low-level robot control.
arXiv Detail & Related papers (2025-10-09T09:08:33Z) - IntentionVLA: Generalizable and Efficient Embodied Intention Reasoning for Human-Robot Interaction [51.130510883952546]
Vision-Language-Action (VLA) models leverage pretrained vision-language models (VLMs) to couple perception with robotic control.<n>We propose textbfIntentionVLA, a VLA framework with a curriculum training paradigm and an efficient inference mechanism.<n>Our proposed method first leverages carefully designed reasoning data that combine intention inference, spatial grounding, and compact embodied reasoning.
arXiv Detail & Related papers (2025-10-09T04:49:46Z) - Enhancing Generalization in Vision-Language-Action Models by Preserving Pretrained Representations [26.678553477485362]
We present a framework that better preserves pretrained features while adapting them for robot manipulation.<n>Our approach introduces three components: (i) a dual-encoder design with one frozen vision to retain pretrained features and another trainable for task adaptation, (ii) a string-based action tokenizer that casts continuous actions into character sequences aligned with the model's pretraining domain, and (iii) a co-training strategy that combines robot demonstrations with vision-language datasets emphasizing spatial reasoning and affordances.
arXiv Detail & Related papers (2025-09-14T20:08:56Z) - Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics [55.05920313034645]
We introduce Robot-R1, a novel framework that leverages reinforcement learning to enhance embodied reasoning specifically for robot control.<n>Inspired by the DeepSeek-R1 learning approach, Robot-R1 samples reasoning-based responses and reinforces those that lead to more accurate predictions.<n>Our experiments show that models trained with Robot-R1 outperform SFT methods on embodied reasoning tasks.
arXiv Detail & Related papers (2025-05-29T16:41:12Z) - Robotic Control via Embodied Chain-of-Thought Reasoning [86.6680905262442]
Key limitation of learned robot control policies is their inability to generalize outside their training data.<n>Recent works on vision-language-action models (VLAs) have shown that the use of large, internet pre-trained vision-language models can substantially improve their robustness and generalization ability.<n>We introduce Embodied Chain-of-Thought Reasoning (ECoT) for VLAs, in which we train VLAs to perform multiple steps of reasoning about plans, sub-tasks, motions, and visually grounded features before predicting the robot action.
arXiv Detail & Related papers (2024-07-11T17:31:01Z) - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic
Control [140.48218261864153]
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control.
Our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training.
arXiv Detail & Related papers (2023-07-28T21:18:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.