Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation
- URL: http://arxiv.org/abs/2508.13998v1
- Date: Tue, 19 Aug 2025 16:50:01 GMT
- Title: Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation
- Authors: Yifu Yuan, Haiqin Cui, Yaoting Huang, Yibin Chen, Fei Ni, Zibin Dong, Pengyi Li, Yan Zheng, Jianye Hao,
- Abstract summary: We introduce Embodied-R1, a 3B Vision-Language Model (VLM) specifically designed for embodied reasoning and pointing.<n>We use a wide range of embodied and general visual reasoning datasets as sources to construct a large-scale dataset, Embodied-Points-200K.<n>Embodied-R1 achieves state-of-the-art performance on 11 embodied spatial and pointing benchmarks.
- Score: 36.57297063636042
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generalization in embodied AI is hindered by the "seeing-to-doing gap," which stems from data scarcity and embodiment heterogeneity. To address this, we pioneer "pointing" as a unified, embodiment-agnostic intermediate representation, defining four core embodied pointing abilities that bridge high-level vision-language comprehension with low-level action primitives. We introduce Embodied-R1, a 3B Vision-Language Model (VLM) specifically designed for embodied reasoning and pointing. We use a wide range of embodied and general visual reasoning datasets as sources to construct a large-scale dataset, Embodied-Points-200K, which supports key embodied pointing capabilities. We then train Embodied-R1 using a two-stage Reinforced Fine-tuning (RFT) curriculum with a specialized multi-task reward design. Embodied-R1 achieves state-of-the-art performance on 11 embodied spatial and pointing benchmarks. Critically, it demonstrates robust zero-shot generalization by achieving a 56.2% success rate in the SIMPLEREnv and 87.5% across 8 real-world XArm tasks without any task-specific fine-tuning, representing a 62% improvement over strong baselines. Furthermore, the model exhibits high robustness against diverse visual disturbances. Our work shows that a pointing-centric representation, combined with an RFT training paradigm, offers an effective and generalizable pathway to closing the perception-action gap in robotics.
Related papers
- Universal Pose Pretraining for Generalizable Vision-Language-Action Policies [83.39008378156647]
Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency.<n>We propose Pose-VLA, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors.<n>Our framework follows a two-stage pre-training pipeline, establishing fundamental spatial grounding via poses followed by motion alignment.
arXiv Detail & Related papers (2026-02-23T11:00:08Z) - Rethinking Video Generation Model for the Embodied World [26.174517437895616]
RBench is designed to evaluate robot-oriented video generation across five task domains and four distinct embodiments.<n> evaluation of 25 representative models highlights significant deficiencies in generating physically realistic robot behaviors.<n>We introduce a refined four-stage data pipeline, resulting in RoVid-X, the largest open-source robotic dataset for video generation with 4 million annotated video clips.
arXiv Detail & Related papers (2026-01-21T18:59:18Z) - SSVP: Synergistic Semantic-Visual Prompting for Industrial Zero-Shot Anomaly Detection [55.54007781679915]
We propose Synergistic Semantic-Visual Prompting (SSVP), that efficiently fuses diverse visual encodings to elevate model's fine-grained perception.<n>SSVP achieves state-of-the-art performance with 93.0% Image-AUROC and 92.2% Pixel-AUROC on MVTec-AD, significantly outperforming existing zero-shot approaches.
arXiv Detail & Related papers (2026-01-14T04:42:19Z) - SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models [73.19077622773075]
We present a comprehensive methodology for building spatial intelligence progressively.<n>We introduce SpatialLadder-26k, a multimodal dataset containing 26,610 samples spanning object localization, single image, multi-view, and video spatial reasoning tasks.<n>We design a three-stage progressive training framework that establishes spatial perception through object localization, develops spatial understanding through multi-dimensional spatial tasks, and strengthens complex reasoning via reinforcement learning with verifiable rewards.
arXiv Detail & Related papers (2025-10-09T17:50:54Z) - Affordance-R1: Reinforcement Learning for Generalizable Affordance Reasoning in Multimodal Large Language Model [23.56313087226691]
Affordance grounding focuses on predicting the specific regions of objects that are associated with the actions to be performed by robots.<n>Existing models often neglect the affordance shared among different objects because they lack the Chain-of-Thought(CoT) reasoning abilities.<n>We propose Affordance-R1, the first unified affordance grounding framework that integrates cognitive CoT guided Group Relative Policy Optimization.
arXiv Detail & Related papers (2025-08-08T10:39:04Z) - Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy [26.455112415445146]
We introduce Skywork-Reward-V2, a suite of eight reward models ranging from 0.6B to 8B parameters, trained on a subset of 26 million preference pairs from SynPref-40M.<n>We demonstrate that Skywork-Reward-V2 is versatile across a wide range of capabilities, including alignment with human preferences, objective correctness, safety, resistance to stylistic biases, and best-of-N scaling.
arXiv Detail & Related papers (2025-07-02T04:40:29Z) - Beginning with You: Perceptual-Initialization Improves Vision-Language Representation and Alignment [2.3735961220736423]
We introduce Perceptual-Initialization (PI), a paradigm shift in visual representation learning.<n>Our approach demonstrates significant zero-shot performance improvements without task-specific fine-tuning.<n>Our work shows that "beginning with you", starting with human perception, provides a stronger foundation for general-purpose vision-language intelligence.
arXiv Detail & Related papers (2025-05-20T11:04:14Z) - From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation [35.79160868966466]
We propose FSD (From Seeing to Doing), a novel vision-language model that generates intermediate representations through spatial relationship reasoning.<n>Our approach combines a hierarchical data pipeline for training with a self-consistency mechanism that aligns spatial coordinates with visual signals.<n>We show that FSD achieves 40.6% success rate in SimplerEnv and 72% success rate across 8 real-world tasks, outperforming the strongest baseline by 30%.
arXiv Detail & Related papers (2025-05-13T13:20:46Z) - Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs [65.93003087656754]
VisFactor is a benchmark that digitizes 20 vision-centric subtests from a well-established cognitive psychology assessment.<n>We evaluate 20 frontier Multimodal Large Language Models (MLLMs) from GPT, Gemini, Claude, LLaMA, Qwen, and SEED families.<n>The best-performing model achieves a score of only 25.19 out of 100, with consistent failures on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination.
arXiv Detail & Related papers (2025-02-23T04:21:32Z) - Identity-Seeking Self-Supervised Representation Learning for
Generalizable Person Re-identification [55.1738496692892]
Prior DG ReID methods employ limited labeled data for training due to the high cost of annotation.
We propose an Identity-seeking Self-supervised Representation learning (ISR) method.
ISR constructs positive pairs from inter-frame images by modeling the instance association as a maximum-weight bipartite matching problem.
ISR achieves 87.0% Rank-1 on Market-1501 and 56.4% Rank-1 on MSMT17, outperforming the best supervised domain-generalizable method by 5.0% and 19.5%, respectively.
arXiv Detail & Related papers (2023-08-17T09:46:27Z) - Scaling Data Generation in Vision-and-Language Navigation [116.95534559103788]
We propose an effective paradigm for generating large-scale data for learning.
We apply 1200+ photo-realistic environments from HM3D and Gibson datasets and synthesizes 4.9 million instruction trajectory pairs.
Thanks to our large-scale dataset, the performance of an existing agent can be pushed up (+11% absolute with regard to previous SoTA) to a significantly new best of 80% single-run success rate on the R2R test split by simple imitation learning.
arXiv Detail & Related papers (2023-07-28T16:03:28Z) - On Exploring Pose Estimation as an Auxiliary Learning Task for
Visible-Infrared Person Re-identification [66.58450185833479]
In this paper, we exploit Pose Estimation as an auxiliary learning task to assist the VI-ReID task in an end-to-end framework.
By jointly training these two tasks in a mutually beneficial manner, our model learns higher quality modality-shared and ID-related features.
Experimental results on two benchmark VI-ReID datasets show that the proposed method consistently improves state-of-the-art methods by significant margins.
arXiv Detail & Related papers (2022-01-11T09:44:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.