Related papers: Phoenix: A Motion-based Self-Reflection Framework for Fine-grained Robotic Action Correction

Phoenix: A Motion-based Self-Reflection Framework for Fine-grained Robotic Action Correction

URL: http://arxiv.org/abs/2504.14588v1
Date: Sun, 20 Apr 2025 12:30:43 GMT
Title: Phoenix: A Motion-based Self-Reflection Framework for Fine-grained Robotic Action Correction
Authors: Wenke Xia, Ruoxuan Feng, Dong Wang, Di Hu,
Abstract summary: Building a generalizable self-correction system is crucial for robots to recover from failures.<n>We build the Phoenix framework, which leverages motion instruction as a bridge to connect high-level semantic reflection with low-level robotic action correction.<n>Experiments conducted in both the RoboMimic simulation and real-world scenarios prove the superior generalization and robustness of our framework.
Score: 10.38090975412416
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Building a generalizable self-correction system is crucial for robots to recover from failures. Despite advancements in Multimodal Large Language Models (MLLMs) that empower robots with semantic reflection ability for failure, translating semantic reflection into how to correct fine-grained robotic actions remains a significant challenge. To address this gap, we build the Phoenix framework, which leverages motion instruction as a bridge to connect high-level semantic reflection with low-level robotic action correction. In this motion-based self-reflection framework, we start with a dual-process motion adjustment mechanism with MLLMs to translate the semantic reflection into coarse-grained motion instruction adjustment. To leverage this motion instruction for guiding how to correct fine-grained robotic actions, a multi-task motion-conditioned diffusion policy is proposed to integrate visual observations for high-frequency robotic action correction. By combining these two models, we could shift the demand for generalization capability from the low-level manipulation policy to the MLLMs-driven motion adjustment model and facilitate precise, fine-grained robotic action correction. Utilizing this framework, we further develop a lifelong learning method to automatically improve the model's capability from interactions with dynamic environments. The experiments conducted in both the RoboMimic simulation and real-world scenarios prove the superior generalization and robustness of our framework across a variety of manipulation tasks. Our code is released at \href{https://github.com/GeWu-Lab/Motion-based-Self-Reflection-Framework}{https://github.com/GeWu-Lab/Motion-based-Self-Reflection-Framework}.

Related papers

Trajectory Adaptation using Large Language Models [0.8704964543257245]
Adapting robot trajectories based on human instructions as per new situations is essential for achieving more intuitive and scalable human-robot interactions.<n>This work proposes a flexible language-based framework to adapt generic robotic trajectories produced by off-the-shelf motion planners.<n>We utilize pre-trained LLMs to adapt trajectory waypoints by generating code as a policy for dense robot manipulation.
arXiv Detail & Related papers (2025-04-17T08:48:23Z)
RoboAct-CLIP: Video-Driven Pre-training of Atomic Action Understanding for Robotics [22.007302996282085]
This paper presents a temporal-decoupling fine-tuning strategy based on Contrastive Language-Image Pretraining (CLIP) architecture.<n>Results in simulated environments demonstrate that the RoboAct-CLIP pretrained model achieves a 12% higher success rate than baseline Visual Language Models.
arXiv Detail & Related papers (2025-04-02T19:02:08Z)
ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer [58.49950218437718]
We present ReCoM, an efficient framework for generating high-fidelity and generalizable human body motions synchronized with speech.<n>The core innovation lies in the Recurrent Embedded Transformer (RET), which integrates Dynamic Embedding Regularization (DER) into a Vision Transformer (ViT) core architecture.<n>To enhance model robustness, we incorporate the proposed DER strategy, which equips the model with dual capabilities of noise resistance and cross-domain generalization.
arXiv Detail & Related papers (2025-03-27T16:39:40Z)
Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation [90.00687889213991]
Solving complex long-horizon robotic manipulation problems requires sophisticated high-level planning capabilities.<n>Vision-language models (VLMs) pretrained on Internet data could in principle offer a framework for tackling such problems.<n>In this paper, we introduce a novel test-time framework that enhancesVLMs' physical reasoning capabilities for multi-stage manipulation tasks.
arXiv Detail & Related papers (2025-02-23T20:42:15Z)
Pre-Trained Video Generative Models as World Simulators [59.546627730477454]
We propose Dynamic World Simulation (DWS) to transform pre-trained video generative models into controllable world simulators. To achieve precise alignment between conditioned actions and generated visual changes, we introduce a lightweight, universal action-conditioned module. Experiments demonstrate that DWS can be versatilely applied to both diffusion and autoregressive transformer models.
arXiv Detail & Related papers (2025-02-10T14:49:09Z)
IRASim: A Fine-Grained World Model for Robot Manipulation [24.591694756757278]
We present IRASim, a novel world model capable of generating videos with fine-grained robot-object interaction details.<n>We train a diffusion transformer and introduce a novel frame-level action-conditioning module within each transformer block to explicitly model and strengthen the action-frame alignment.
arXiv Detail & Related papers (2024-06-20T17:50:16Z)
Guided Decoding for Robot On-line Motion Generation and Adaption [44.959409835754634]
We present a novel motion generation approach for robot arms, with high degrees of freedom, in complex settings that can adapt online to obstacles or new via points. We train a transformer architecture, based on conditional variational autoencoder, on a large dataset of simulated trajectories used as demonstrations. We show that our model successfully generates motion from different initial and target points and that is capable of generating trajectories that navigate complex tasks across different robotic platforms.
arXiv Detail & Related papers (2024-03-22T14:32:27Z)
Language to Rewards for Robotic Skill Synthesis [37.21434094015743]
We introduce a new paradigm that harnesses large language models (LLMs) to define reward parameters that can be optimized and accomplish variety of robotic tasks. Using reward as the intermediate interface generated by LLMs, we can effectively bridge the gap between high-level language instructions or corrections to low-level robot actions.
arXiv Detail & Related papers (2023-06-14T17:27:10Z)
Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.<n>Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.<n>Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z)
Interactive Character Control with Auto-Regressive Motion Diffusion Models [18.727066177880708]
We propose A-MDM (Auto-regressive Motion Diffusion Model) for real-time motion synthesis. Our conditional diffusion model takes an initial pose as input, and auto-regressively generates successive motion frames conditioned on previous frame. We introduce a suite of techniques for incorporating interactive controls into A-MDM, such as task-oriented sampling, in-painting, and hierarchical reinforcement learning.
arXiv Detail & Related papers (2023-06-01T07:48:34Z)
Causal Policy Gradient for Whole-Body Mobile Manipulation [39.3461626518495]
We introduce Causal MoMa, a new reinforcement learning framework to train policies for typical MoMa tasks. We evaluate the performance of Causal MoMa on three types of simulated robots across different MoMa tasks.
arXiv Detail & Related papers (2023-05-04T23:23:47Z)
Nonprehensile Riemannian Motion Predictive Control [57.295751294224765]
We introduce a novel Real-to-Sim reward analysis technique to reliably imagine and predict the outcome of taking possible actions for a real robotic platform. We produce a closed-loop controller to reactively push objects in a continuous action space. We observe that RMPC is robust in cluttered as well as occluded environments and outperforms the baselines.
arXiv Detail & Related papers (2021-11-15T18:50:04Z)
V-MAO: Generative Modeling for Multi-Arm Manipulation of Articulated Objects [51.79035249464852]
We present a framework for learning multi-arm manipulation of articulated objects. Our framework includes a variational generative model that learns contact point distribution over object rigid parts for each robot arm.
arXiv Detail & Related papers (2021-11-07T02:31:09Z)
ReLMoGen: Leveraging Motion Generation in Reinforcement Learning for Mobile Manipulation [99.2543521972137]
ReLMoGen is a framework that combines a learned policy to predict subgoals and a motion generator to plan and execute the motion needed to reach these subgoals. Our method is benchmarked on a diverse set of seven robotics tasks in photo-realistic simulation environments. ReLMoGen shows outstanding transferability between different motion generators at test time, indicating a great potential to transfer to real robots.
arXiv Detail & Related papers (2020-08-18T08:05:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.