Human-in-the-loop Online Rejection Sampling for Robotic Manipulation
- URL: http://arxiv.org/abs/2510.26406v1
- Date: Thu, 30 Oct 2025 11:53:08 GMT
- Title: Human-in-the-loop Online Rejection Sampling for Robotic Manipulation
- Authors: Guanxing Lu, Rui Zhao, Haitao Lin, He Zhang, Yansong Tang,
- Abstract summary: Hi-ORS stabilizes value estimation by filtering out negatively rewarded samples during online fine-tuning.<n>Hi-ORS fine-tunes a pi-base policy to master contact-rich manipulation in just 1.5 hours of real-world training.
- Score: 55.99788088622936
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning (RL) is widely used to produce robust robotic manipulation policies, but fine-tuning vision-language-action (VLA) models with RL can be unstable due to inaccurate value estimates and sparse supervision at intermediate steps. In contrast, imitation learning (IL) is easy to train but often underperforms due to its offline nature. In this paper, we propose Hi-ORS, a simple yet effective post-training method that utilizes rejection sampling to achieve both training stability and high robustness. Hi-ORS stabilizes value estimation by filtering out negatively rewarded samples during online fine-tuning, and adopts a reward-weighted supervised training objective to provide dense intermediate-step supervision. For systematic study, we develop an asynchronous inference-training framework that supports flexible online human-in-the-loop corrections, which serve as explicit guidance for learning error-recovery behaviors. Across three real-world tasks and two embodiments, Hi-ORS fine-tunes a pi-base policy to master contact-rich manipulation in just 1.5 hours of real-world training, outperforming RL and IL baselines by a substantial margin in both effectiveness and efficiency. Notably, the fine-tuned policy exhibits strong test-time scalability by reliably executing complex error-recovery behaviors to achieve better performance.
Related papers
- Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training [63.34044358216334]
ACTOR-CURATOR is a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models.<n> Empirically, ACTOR-CURATOR consistently outperforms uniform sampling and strong curriculum baselines.
arXiv Detail & Related papers (2026-02-24T04:19:48Z) - AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering [52.67783579040657]
AceGRPO is a machine learning system that prioritizes tasks at the agent's learning frontier to maximize learning efficiency.<n>Our trained Ace-30B model achieves a 100% valid submission rate on MLE-Bench-Lite, approaches the performance of proprietary frontier models, and outperforms larger open-source baselines.
arXiv Detail & Related papers (2026-02-08T10:55:03Z) - CRL-VLA: Continual Vision-Language-Action Learning [40.18167835795084]
Continual Reinforcement Learning is a promising pathway for deploying VLA models in lifelong robotic scenarios.<n>We introduce CRL-VLA, a framework for continual post-training of VLA models with rigorous theoretical bounds.<n>We derive a unified performance bound linking the stability-plasticity trade-off to goal-conditioned advantage magnitude, scaled by policy divergence.
arXiv Detail & Related papers (2026-02-03T12:09:53Z) - Failure-Aware RL: Reliable Offline-to-Online Reinforcement Learning with Self-Recovery for Real-World Manipulation [48.26705293834693]
Failure-Aware Offline-to-Online Reinforcement Learning (FARL) is a new paradigm minimizing failures during real-world reinforcement learning.<n>We propose an algorithm that integrates a world-model-based safety critic and a recovery policy trained offline to prevent failures during online exploration.
arXiv Detail & Related papers (2026-01-12T18:53:11Z) - RobustVLA: Robustness-Aware Reinforcement Post-Training for Vision-Language-Action Models [33.503927352666096]
Vision-Language-Action (VLA) models fail to generalize reliably in out-of-distribution deployments.<n>We introduce RobustVLA, a lightweight online RL post-training method designed to explicitly enhance the resilience of VLA models.<n>Our results highlight the importance of robustness-aware RL post-training as a key step toward improving the principled reliability and robustness of VLA models.
arXiv Detail & Related papers (2025-11-03T08:30:48Z) - Adversarial Fine-tuning in Offline-to-Online Reinforcement Learning for Robust Robot Control [12.961180148172199]
This study introduces an offline-to-online framework that trains policies on clean data and then performs adversarial fine-tuning.<n>A performance-aware curriculum adjusts the perturbation probability during training via an exponential-moving-average signal.<n>Experiments on continuous-control locomotion tasks demonstrate that the proposed method consistently improves robustness over offline-only baselines.
arXiv Detail & Related papers (2025-10-15T09:45:24Z) - Agentic Reinforcement Learning with Implicit Step Rewards [92.26560379363492]
Large language models (LLMs) are increasingly developed as autonomous agents using reinforcement learning (agentic RL)<n>We introduce implicit step rewards for agentic RL (iStar), a general credit-assignment strategy that integrates seamlessly with standard RL algorithms.<n>We evaluate our method on three challenging agent benchmarks, including WebShop and VisualSokoban, as well as open-ended social interactions with unverifiable rewards in SOTOPIA.
arXiv Detail & Related papers (2025-09-23T16:15:42Z) - Towards High Data Efficiency in Reinforcement Learning with Verifiable Reward [54.708851958671794]
We propose a Data-Efficient Policy Optimization pipeline that combines optimized strategies for both offline and online data selection.<n>In offline phase, we curate a high-quality subset of training samples based on diversity, influence, and appropriate difficulty.<n>During online RLVR training, we introduce a sample-level explorability metric to dynamically filter samples with low exploration potential.
arXiv Detail & Related papers (2025-09-01T10:04:20Z) - IN-RIL: Interleaved Reinforcement and Imitation Learning for Policy Fine-Tuning [25.642307880136332]
Imitation learning (IL) and reinforcement learning (RL) each offer distinct advantages for robotics policy learning.<n>While existing robot learning approaches using IL-based pre-training followed by RL-based fine-tuning are promising, this two-step learning paradigm often suffers from instability and poor sample efficiency during the RL fine-tuning phase.<n>In this work, we introduce IN-RIL, INterleaved Reinforcement learning and Imitation Learning, for policy fine-tuning.
arXiv Detail & Related papers (2025-05-15T16:01:21Z) - Offline Robotic World Model: Learning Robotic Policies without a Physics Simulator [50.191655141020505]
Reinforcement Learning (RL) has demonstrated impressive capabilities in robotic control but remains challenging due to high sample complexity, safety concerns, and the sim-to-real gap.<n>We introduce Offline Robotic World Model (RWM-O), a model-based approach that explicitly estimates uncertainty to improve policy learning without reliance on a physics simulator.
arXiv Detail & Related papers (2025-04-23T12:58:15Z) - Learning from Suboptimal Data in Continuous Control via Auto-Regressive Soft Q-Network [23.481553466650453]
We propose Auto-Regressive Soft Q-learning (ARSQ), a value-based RL algorithm that models Q-values in a coarse-to-fine, auto-regressive manner.<n>ARSQ decomposes the continuous action space into discrete spaces in a coarse-to-fine hierarchy, enhancing sample efficiency for fine-grained continuous control tasks.<n>It auto-regressively predicts dimensional action advantages within each decision step, enabling more effective decision-making in continuous control tasks.
arXiv Detail & Related papers (2025-02-01T03:04:53Z) - Training Language Models to Self-Correct via Reinforcement Learning [98.35197671595343]
Self-correction has been found to be largely ineffective in modern large language models (LLMs)
We develop a multi-turn online reinforcement learning approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data.
We find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.
arXiv Detail & Related papers (2024-09-19T17:16:21Z) - Robust Deep Reinforcement Learning with Adaptive Adversarial Perturbations in Action Space [3.639580365066386]
We propose an adaptive adversarial coefficient framework to adjust the effect of the adversarial perturbation during training.
The appealing feature of our method is that it is simple to deploy in real-world applications and does not require accessing the simulator in advance.
The experiments in MuJoCo show that our method can improve the training stability and learn a robust policy when migrated to different test environments.
arXiv Detail & Related papers (2024-05-20T12:31:11Z) - Aligning Language Models with Offline Learning from Human Feedback [5.539080592071948]
We propose an offline learning from human feedback framework to align language models without interacting with environments.
Specifically, we explore filtering alignment (FA), reward-weighted regression (RWR), and conditional alignment (CA) to align language models to human preferences.
arXiv Detail & Related papers (2023-08-23T10:41:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.