Reliable and Scalable Robot Policy Evaluation with Imperfect Simulators
- URL: http://arxiv.org/abs/2510.04354v1
- Date: Sun, 05 Oct 2025 20:37:53 GMT
- Title: Reliable and Scalable Robot Policy Evaluation with Imperfect Simulators
- Authors: Apurva Badithela, David Snyder, Lihan Zha, Joseph Mikhail, Matthew O'Kelly, Anushri Dixit, Anirudha Majumdar,
- Abstract summary: SureSim is a framework to augment large-scale simulation with relatively small-scale real-world testing.<n>We leverage non-asymptotic mean estimation algorithms to provide confidence intervals on mean policy performance.<n>Our approach saves over (20-25%) of hardware evaluation effort to achieve similar bounds on policy performance.
- Score: 9.868826622074899
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Rapid progress in imitation learning, foundation models, and large-scale datasets has led to robot manipulation policies that generalize to a wide-range of tasks and environments. However, rigorous evaluation of these policies remains a challenge. Typically in practice, robot policies are often evaluated on a small number of hardware trials without any statistical assurances. We present SureSim, a framework to augment large-scale simulation with relatively small-scale real-world testing to provide reliable inferences on the real-world performance of a policy. Our key idea is to formalize the problem of combining real and simulation evaluations as a prediction-powered inference problem, in which a small number of paired real and simulation evaluations are used to rectify bias in large-scale simulation. We then leverage non-asymptotic mean estimation algorithms to provide confidence intervals on mean policy performance. Using physics-based simulation, we evaluate both diffusion policy and multi-task fine-tuned \(\pi_0\) on a joint distribution of objects and initial conditions, and find that our approach saves over \(20-25\%\) of hardware evaluation effort to achieve similar bounds on policy performance.
Related papers
- ARTIS: Agentic Risk-Aware Test-Time Scaling via Iterative Simulation [72.78362530982109]
ARTIS, Agentic Risk-Aware Test-Time Scaling via Iterative Simulation, is a framework that decouples exploration from commitment.<n>We show that naive LLM-based simulators struggle to capture rare but high-impact failure modes.<n>We introduce a risk-aware tool simulator that emphasizes fidelity on failure-inducing actions.
arXiv Detail & Related papers (2026-02-02T06:33:22Z) - PolaRiS: Scalable Real-to-Sim Evaluations for Generalist Robot Policies [88.78188489161028]
We introduce Policy Evaluation and Environment Reconstruction in Simulation (PolaRiS)<n>PolaRiS is a scalable real-to-sim framework for high-fidelity simulated robot evaluation.<n>We show that PolaRiS evaluations provide a much stronger correlation to real world generalist policy performance than existing simulated benchmarks.
arXiv Detail & Related papers (2025-12-18T18:49:41Z) - RobotArena $\infty$: Scalable Robot Benchmarking via Real-to-Sim Translation [47.79800816696372]
Real-world testing of manipulation policies is labor-intensive at scale, and difficult to reproduce.<n>Existing simulation benchmarks are similarly limited, as they train and test policies within the same synthetic domains.<n>In this paper, we introduce a new benchmarking framework that overcomes these challenges by shifting VLA evaluation into large-scale simulated augmented environments.
arXiv Detail & Related papers (2025-10-27T17:41:38Z) - Generalizable Domain Adaptation for Sim-and-Real Policy Co-Training [21.855770200309674]
We propose a unified sim-and-real co-training framework for learning generalizable manipulation policies.<n>We show it can leverage abundant simulation data to achieve up to a 30% improvement in the real-world success rate.
arXiv Detail & Related papers (2025-09-23T04:32:53Z) - Pseudo-Simulation for Autonomous Driving [66.1981253104508]
Existing evaluation paradigms for Autonomous Vehicles (AVs) face critical limitations.<n>Real-world evaluation is often challenging due to safety concerns and a lack of realism.<n>Open-loop evaluation relies on metrics that generally overlook compounding errors.
arXiv Detail & Related papers (2025-06-04T17:57:53Z) - NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking [65.24988062003096]
We present NAVSIM, a framework for benchmarking vision-based driving policies.
Our simulation is non-reactive, i.e., the evaluated policy and environment do not influence each other.
NAVSIM enabled a new competition held at CVPR 2024, where 143 teams submitted 463 entries, resulting in several new insights.
arXiv Detail & Related papers (2024-06-21T17:59:02Z) - Evaluating Real-World Robot Manipulation Policies in Simulation [91.55267186958892]
Control and visual disparities between real and simulated environments are key challenges for reliable simulated evaluation.
We propose approaches for mitigating these gaps without needing to craft full-fidelity digital twins of real-world environments.
We create SIMPLER, a collection of simulated environments for manipulation policy evaluation on common real robot setups.
arXiv Detail & Related papers (2024-05-09T17:30:16Z) - How Generalizable Is My Behavior Cloning Policy? A Statistical Approach to Trustworthy Performance Evaluation [17.638831964639834]
Behavior cloning policies are increasingly successful at solving complex tasks by learning from human demonstrations.
We present a framework that provides a tight lower-bound on robot performance in an arbitrary environment.
In experiments we evaluate policies for visuomotor manipulation in both simulation and hardware.
arXiv Detail & Related papers (2024-05-08T22:00:35Z) - Marginalized Importance Sampling for Off-Environment Policy Evaluation [13.824507564510503]
Reinforcement Learning (RL) methods are typically sample-inefficient, making it challenging to train and deploy RL-policies in real world robots.
This paper proposes a new approach to evaluate the real-world performance of agent policies prior to deploying them in the real world.
Our approach incorporates a simulator along with real-world offline data to evaluate the performance of any policy.
arXiv Detail & Related papers (2023-09-04T20:52:04Z) - Robust Visual Sim-to-Real Transfer for Robotic Manipulation [79.66851068682779]
Learning visuomotor policies in simulation is much safer and cheaper than in the real world.
However, due to discrepancies between the simulated and real data, simulator-trained policies often fail when transferred to real robots.
One common approach to bridge the visual sim-to-real domain gap is domain randomization (DR)
arXiv Detail & Related papers (2023-07-28T05:47:24Z) - Reactive Long Horizon Task Execution via Visual Skill and Precondition
Models [59.76233967614774]
We describe an approach for sim-to-real training that can accomplish unseen robotic tasks using models learned in simulation to ground components of a simple task planner.
We show an increase in success rate from 91.6% to 98% in simulation and from 10% to 80% success rate in the real-world as compared with naive baselines.
arXiv Detail & Related papers (2020-11-17T15:24:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.