Related papers: Reliable and Scalable Robot Policy Evaluation with Imperfect Simulators

Reliable and Scalable Robot Policy Evaluation with Imperfect Simulators

URL: http://arxiv.org/abs/2510.04354v1
Date: Sun, 05 Oct 2025 20:37:53 GMT
Title: Reliable and Scalable Robot Policy Evaluation with Imperfect Simulators
Authors: Apurva Badithela, David Snyder, Lihan Zha, Joseph Mikhail, Matthew O'Kelly, Anushri Dixit, Anirudha Majumdar,
Abstract summary: SureSim is a framework to augment large-scale simulation with relatively small-scale real-world testing.<n>We leverage non-asymptotic mean estimation algorithms to provide confidence intervals on mean policy performance.<n>Our approach saves over (20-25%) of hardware evaluation effort to achieve similar bounds on policy performance.
Score: 9.868826622074899
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Rapid progress in imitation learning, foundation models, and large-scale datasets has led to robot manipulation policies that generalize to a wide-range of tasks and environments. However, rigorous evaluation of these policies remains a challenge. Typically in practice, robot policies are often evaluated on a small number of hardware trials without any statistical assurances. We present SureSim, a framework to augment large-scale simulation with relatively small-scale real-world testing to provide reliable inferences on the real-world performance of a policy. Our key idea is to formalize the problem of combining real and simulation evaluations as a prediction-powered inference problem, in which a small number of paired real and simulation evaluations are used to rectify bias in large-scale simulation. We then leverage non-asymptotic mean estimation algorithms to provide confidence intervals on mean policy performance. Using physics-based simulation, we evaluate both diffusion policy and multi-task fine-tuned $\pi_0$ on a joint distribution of objects and initial conditions, and find that our approach saves over $20-25\%$ of hardware evaluation effort to achieve similar bounds on policy performance.

Related papers

ARTIS: Agentic Risk-Aware Test-Time Scaling via Iterative Simulation [72.78362530982109]
ARTIS, Agentic Risk-Aware Test-Time Scaling via Iterative Simulation, is a framework that decouples exploration from commitment.<n>We show that naive LLM-based simulators struggle to capture rare but high-impact failure modes.<n>We introduce a risk-aware tool simulator that emphasizes fidelity on failure-inducing actions.
arXiv Detail & Related papers (2026-02-02T06:33:22Z)
PolaRiS: Scalable Real-to-Sim Evaluations for Generalist Robot Policies [88.78188489161028]
We introduce Policy Evaluation and Environment Reconstruction in Simulation (PolaRiS)<n>PolaRiS is a scalable real-to-sim framework for high-fidelity simulated robot evaluation.<n>We show that PolaRiS evaluations provide a much stronger correlation to real world generalist policy performance than existing simulated benchmarks.
arXiv Detail & Related papers (2025-12-18T18:49:41Z)
RobotArena $\infty$: Scalable Robot Benchmarking via Real-to-Sim Translation [47.79800816696372]
Real-world testing of manipulation policies is labor-intensive at scale, and difficult to reproduce.<n>Existing simulation benchmarks are similarly limited, as they train and test policies within the same synthetic domains.<n>In this paper, we introduce a new benchmarking framework that overcomes these challenges by shifting VLA evaluation into large-scale simulated augmented environments.
arXiv Detail & Related papers (2025-10-27T17:41:38Z)
Generalizable Domain Adaptation for Sim-and-Real Policy Co-Training [21.855770200309674]
We propose a unified sim-and-real co-training framework for learning generalizable manipulation policies.<n>We show it can leverage abundant simulation data to achieve up to a 30% improvement in the real-world success rate.
arXiv Detail & Related papers (2025-09-23T04:32:53Z)
Pseudo-Simulation for Autonomous Driving [66.1981253104508]
Existing evaluation paradigms for Autonomous Vehicles (AVs) face critical limitations.<n>Real-world evaluation is often challenging due to safety concerns and a lack of realism.<n>Open-loop evaluation relies on metrics that generally overlook compounding errors.
arXiv Detail & Related papers (2025-06-04T17:57:53Z)
NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking [65.24988062003096]
We present NAVSIM, a framework for benchmarking vision-based driving policies. Our simulation is non-reactive, i.e., the evaluated policy and environment do not influence each other. NAVSIM enabled a new competition held at CVPR 2024, where 143 teams submitted 463 entries, resulting in several new insights.
arXiv Detail & Related papers (2024-06-21T17:59:02Z)
Evaluating Real-World Robot Manipulation Policies in Simulation [91.55267186958892]
Control and visual disparities between real and simulated environments are key challenges for reliable simulated evaluation. We propose approaches for mitigating these gaps without needing to craft full-fidelity digital twins of real-world environments. We create SIMPLER, a collection of simulated environments for manipulation policy evaluation on common real robot setups.
arXiv Detail & Related papers (2024-05-09T17:30:16Z)
How Generalizable Is My Behavior Cloning Policy? A Statistical Approach to Trustworthy Performance Evaluation [17.638831964639834]
Behavior cloning policies are increasingly successful at solving complex tasks by learning from human demonstrations. We present a framework that provides a tight lower-bound on robot performance in an arbitrary environment. In experiments we evaluate policies for visuomotor manipulation in both simulation and hardware.
arXiv Detail & Related papers (2024-05-08T22:00:35Z)
Marginalized Importance Sampling for Off-Environment Policy Evaluation [13.824507564510503]
Reinforcement Learning (RL) methods are typically sample-inefficient, making it challenging to train and deploy RL-policies in real world robots. This paper proposes a new approach to evaluate the real-world performance of agent policies prior to deploying them in the real world. Our approach incorporates a simulator along with real-world offline data to evaluate the performance of any policy.
arXiv Detail & Related papers (2023-09-04T20:52:04Z)
Robust Visual Sim-to-Real Transfer for Robotic Manipulation [79.66851068682779]
Learning visuomotor policies in simulation is much safer and cheaper than in the real world. However, due to discrepancies between the simulated and real data, simulator-trained policies often fail when transferred to real robots. One common approach to bridge the visual sim-to-real domain gap is domain randomization (DR)
arXiv Detail & Related papers (2023-07-28T05:47:24Z)
Reactive Long Horizon Task Execution via Visual Skill and Precondition Models [59.76233967614774]
We describe an approach for sim-to-real training that can accomplish unseen robotic tasks using models learned in simulation to ground components of a simple task planner. We show an increase in success rate from 91.6% to 98% in simulation and from 10% to 80% success rate in the real-world as compared with naive baselines.
arXiv Detail & Related papers (2020-11-17T15:24:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.