Related papers: RoboGPT-R1: Enhancing Robot Planning with Reinforcement Learning

RoboGPT-R1: Enhancing Robot Planning with Reinforcement Learning

URL: http://arxiv.org/abs/2510.14828v2
Date: Wed, 22 Oct 2025 13:03:47 GMT
Title: RoboGPT-R1: Enhancing Robot Planning with Reinforcement Learning
Authors: Jinrui Liu, Bingyan Nie, Boyu Li, Yaran Chen, Yuze Wang, Shunsen He, Haoran Li,
Abstract summary: We propose RoboGPT-R1, a two-stage fine-tuning framework for embodied planning.<n>In this framework, supervised training acquires foundational knowledge through expert sequences, followed by RL to address the model's shortcomings in visual-spatial understanding and reasoning.<n>The reasoning model, trained on Qwen2.5-VL-3B, significantly outperforms the larger-scale model, GPT-4o-mini, by 21.33% and surpasses other work trained on Qwen2.5-VL-7B by 20.33% on the EmbodiedBench benchmark.
Score: 6.12099996406339
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Improving the reasoning capabilities of embodied agents is crucial for robots to complete complex human instructions in long-view manipulation tasks successfully. Despite the success of large language models and vision language models based on Supervised Fine-Tuning (SFT) in planning tasks, they continue facing challenges in performing long-horizon manipulation tasks in complex real-world environments, owing to their restricted common sense and reasoning capabilities. Considering that aligning general-purpose vision language models to robotic planning tasks via supervised fine-tuning suffers from poor generalization and insufficient physical understanding, we propose RoboGPT-R1, a two-stage fine-tuning framework for embodied planning. In this framework, supervised training acquires foundational knowledge through expert sequences, followed by RL to address the model's shortcomings in visual-spatial understanding and reasoning. To achieve physical understanding and action sequence consistency in multi-step reasoning tasks, we design a rule-based reward function that simultaneously considers long-horizon performance and action constraint in the environment. The reasoning model, trained on Qwen2.5-VL-3B, significantly outperforms the larger-scale model, GPT-4o-mini, by 21.33% and surpasses other work trained on Qwen2.5-VL-7B by 20.33% on the EmbodiedBench benchmark.

Related papers

MagicAgent: Towards Generalized Agent Planning [73.21129030631421]
We present textbfMagicAgent, a series of foundation models specifically designed for generalized agent planning.<n>We introduce a lightweight and scalable synthetic data framework that generates high-quality trajectories across diverse planning tasks.<n>We show that MagicAgent-32B and MagicAgent-30B-A3B achieve superior performance across diverse open-source benchmarks.
arXiv Detail & Related papers (2026-02-22T01:39:16Z)
Mind to Hand: Purposeful Robotic Control via Embodied Reasoning [12.275897522668858]
We introduce Lumo-1, a model that unifies robot reasoning ("mind") with robot action ("hand")<n>Our approach builds upon the general multi-modal reasoning capabilities of pre-trained vision-language models (VLMs)<n>We integrate reinforcement learning to further refine reasoning-action consistency and close the loop between semantic inference and motor control.
arXiv Detail & Related papers (2025-12-09T13:19:37Z)
Robix: A Unified Model for Robot Interaction, Reasoning and Planning [28.191138548365203]
Robix is a unified model that integrates robot reasoning, task planning, and natural language interaction within a single vision-language architecture.<n>Robix dynamically generates atomic commands for the low-level controller and verbal responses for human interaction.
arXiv Detail & Related papers (2025-09-01T03:53:47Z)
Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics [55.05920313034645]
We introduce Robot-R1, a novel framework that leverages reinforcement learning to enhance embodied reasoning specifically for robot control.<n>Inspired by the DeepSeek-R1 learning approach, Robot-R1 samples reasoning-based responses and reinforces those that lead to more accurate predictions.<n>Our experiments show that models trained with Robot-R1 outperform SFT methods on embodied reasoning tasks.
arXiv Detail & Related papers (2025-05-29T16:41:12Z)
REMAC: Self-Reflective and Self-Evolving Multi-Agent Collaboration for Long-Horizon Robot Manipulation [57.628771707989166]
We propose an adaptive multi-agent planning framework, termed REMAC, that enables efficient, scene-agnostic multi-robot long-horizon task planning and execution.<n>ReMAC incorporates two key modules: a self-reflection module performing pre-conditions and post-condition checks in the loop to evaluate progress and refine plans, and a self-evolvement module dynamically adapting plans based on scene-specific reasoning.
arXiv Detail & Related papers (2025-03-28T03:51:40Z)
Towards Autonomous Reinforcement Learning for Real-World Robotic Manipulation with Large Language Models [5.2364456910271935]
We propose an unsupervised pipeline to generate reward functions from natural language task descriptions.<n>The rewards are used to train RL agents in simulated environments, where we formalize the reward generation process to enhance feasibility.<n>Our approach is validated through extensive simulated experiments on single-arm and bi-manual manipulation tasks using an ABB YuMi collaborative robot.
arXiv Detail & Related papers (2025-03-06T10:08:44Z)
Grounding Language Models in Autonomous Loco-manipulation Tasks [3.8363685417355557]
We propose a novel framework that learns, selects, and plans behaviors based on tasks in different scenarios. We leverage the planning and reasoning features of the large language model (LLM), constructing a hierarchical task graph. Experiments in simulation and real-world using the CENTAURO robot show that the language model based planner can efficiently adapt to new loco-manipulation tasks.
arXiv Detail & Related papers (2024-09-02T15:27:48Z)
Robotic Control via Embodied Chain-of-Thought Reasoning [86.6680905262442]
Key limitation of learned robot control policies is their inability to generalize outside their training data.<n>Recent works on vision-language-action models (VLAs) have shown that the use of large, internet pre-trained vision-language models can substantially improve their robustness and generalization ability.<n>We introduce Embodied Chain-of-Thought Reasoning (ECoT) for VLAs, in which we train VLAs to perform multiple steps of reasoning about plans, sub-tasks, motions, and visually grounded features before predicting the robot action.
arXiv Detail & Related papers (2024-07-11T17:31:01Z)
SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation [62.58480650443393]
Segment Anything (SAM) is a vision-foundation model for generalizable scene understanding and sequence imitation. We develop a novel multi-channel heatmap that enables the prediction of the action sequence in a single pass.
arXiv Detail & Related papers (2024-05-30T00:32:51Z)
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought [95.37585041654535]
Embodied AI is capable of planning and executing action sequences for robots to accomplish long-horizon tasks in physical environments. In this work, we introduce EmbodiedGPT, an end-to-end multi-modal foundation model for embodied AI. Experiments show the effectiveness of EmbodiedGPT on embodied tasks, including embodied planning, embodied control, visual captioning, and visual question answering.
arXiv Detail & Related papers (2023-05-24T11:04:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.