FuguReport

Referring-Aware Visuomotor Policy Learning for Closed-Loop Manipulation

Authors Jiahua Ma, Yiran Qin, Xin Wen, Yixiong Li, Yuyu Sun, Yulan Guo, Liang Lin, Ruimao Zhang
Affiliations Sun Yat-Sen University / University of Oxford
Categories Method / Visuomotor Control / Policy learning with referential input, Application / Robotic Manipulation / Closed-loop manipulation tasks, Evaluation / Imitation Learning / Training with perturbed demonstrations
License CC BY 4.0

Abstract Overview

This paper introduces ReV (Referring-Aware Visuomotor Policy), a closed-loop imitation learning framework for robotic manipulation that incorporates sparse 3D referring points provided by a human or a high-level planner during execution. The architecture uses coupled diffusion heads: a Global Diffusion Head (GDH) generates temporally sparse but globally consistent action anchors, while a Local Diffusion Head (LDH) interpolates them into fine-grained executable trajectories via a learnable, temporal-position-dependent strategy. A temporal-position prediction module localizes where the referring point falls along the trajectory timeline, and a masked trajectory-steering mechanism enforces passage near the referred point during denoising. Training relies solely on expert demonstrations augmented with targeted perturbations (seventh-order polynomial spline blending of perturbed actions), requiring no additional correction datasets or post-hoc fine-tuning.

Novelty

The primary novelty is a referring-aware imitation learning framework that enables a manipulation policy to react online to sparse external 3D via-points without requiring recovery data or post-hoc fine-tuning. Architecturally, the paper introduces coupled diffusion heads (GDH for sparse global anchors, LDH for temporal-position-conditioned dense interpolation) combined with a temporal-position prediction module and masked trajectory-steering strategy, enabling coarse-to-fine trajectory replanning under point-level spatial guidance.

Results

On four modified simulated via-point tasks, ReV achieves 100% region penetration in all cases with success rates of 91%, 100%, 50%, and 92%, substantially outperforming baselines (ACT, DP3, CDP, OCTO, MPD) that largely fail to follow the provided referring points. The coupled diffusion heads architecture also improves task success rates across 13 tasks spanning Adroit, DexArt, MetaWorld, and RoboFactory benchmarks relative to ACT, DP3, and CDP. In five real-world referring-aware tasks, ReV achieves 30/30 penetration trials in every task with task-success counts of 20/30, 21/30, 15/30, 18/30, and 12/30, outperforming ACT and DP baselines.

Key Points

  1. ReV uses a temporal-position prediction module and masked trajectory-steering strategy to incorporate sparse external 3D referring points into closed-loop manipulation trajectories, achieving 100% region penetration across all simulated via-point tasks while maintaining task completion.
  2. The coupled diffusion heads architecture separates long-horizon global planning (GDH for sparse action anchors) from short-horizon dense trajectory generation (LDH with temporal-position-conditioned interpolation), consistently improving task success rates across 13 simulated benchmarks compared to ACT, DP3, and CDP.
  3. The framework is trained solely from expert demonstrations augmented with targeted perturbations, requiring no correction datasets or fine-tuning, and demonstrates robustness to out-of-distribution referring points with graceful performance degradation as deviation increases.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.