Related papers: AutoEval: Autonomous Evaluation of Generalist Robot Manipulation Policies in the Real World

AutoEval: Autonomous Evaluation of Generalist Robot Manipulation Policies in the Real World

URL: http://arxiv.org/abs/2503.24278v2
Date: Wed, 02 Apr 2025 20:24:22 GMT
Title: AutoEval: Autonomous Evaluation of Generalist Robot Manipulation Policies in the Real World
Authors: Zhiyuan Zhou, Pranav Atreya, You Liang Tan, Karl Pertsch, Sergey Levine,
Abstract summary: AutoEval is a system to autonomously evaluate robot policies around the clock with minimal human intervention.<n>We show that AutoEval can nearly fully eliminate human involvement in the evaluation process.<n>We provide public access to multiple AutoEval scenes in the popular BridgeData robot setup with WidowX robot arms.
Score: 45.70178627573973
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scalable and reproducible policy evaluation has been a long-standing challenge in robot learning. Evaluations are critical to assess progress and build better policies, but evaluation in the real world, especially at a scale that would provide statistically reliable results, is costly in terms of human time and hard to obtain. Evaluation of increasingly generalist robot policies requires an increasingly diverse repertoire of evaluation environments, making the evaluation bottleneck even more pronounced. To make real-world evaluation of robotic policies more practical, we propose AutoEval, a system to autonomously evaluate generalist robot policies around the clock with minimal human intervention. Users interact with AutoEval by submitting evaluation jobs to the AutoEval queue, much like how software jobs are submitted with a cluster scheduling system, and AutoEval will schedule the policies for evaluation within a framework supplying automatic success detection and automatic scene resets. We show that AutoEval can nearly fully eliminate human involvement in the evaluation process, permitting around the clock evaluations, and the evaluation results correspond closely to ground truth evaluations conducted by hand. To facilitate the evaluation of generalist policies in the robotics community, we provide public access to multiple AutoEval scenes in the popular BridgeData robot setup with WidowX robot arms. In the future, we hope that AutoEval scenes can be set up across institutions to form a diverse and distributed evaluation network.

Related papers

PolaRiS: Scalable Real-to-Sim Evaluations for Generalist Robot Policies [88.78188489161028]
We introduce Policy Evaluation and Environment Reconstruction in Simulation (PolaRiS)<n>PolaRiS is a scalable real-to-sim framework for high-fidelity simulated robot evaluation.<n>We show that PolaRiS evaluations provide a much stronger correlation to real world generalist policy performance than existing simulated benchmarks.
arXiv Detail & Related papers (2025-12-18T18:49:41Z)
RobotArena $\infty$: Scalable Robot Benchmarking via Real-to-Sim Translation [47.79800816696372]
Real-world testing of manipulation policies is labor-intensive at scale, and difficult to reproduce.<n>Existing simulation benchmarks are similarly limited, as they train and test policies within the same synthetic domains.<n>In this paper, we introduce a new benchmarking framework that overcomes these challenges by shifting VLA evaluation into large-scale simulated augmented environments.
arXiv Detail & Related papers (2025-10-27T17:41:38Z)
RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies [125.35572632340602]
We propose RoboArena, a new approach for scalable evaluation of generalist robot policies in the real world.<n>Instead of standardizing evaluations around fixed tasks, environments, or locations, we propose to crowd-source evaluations across a distributed network of evaluators.<n>We instantiate our approach across a network of evaluators at seven academic institutions using the DROID robot platform.
arXiv Detail & Related papers (2025-06-22T18:13:31Z)
Evaluating Robot Policies in a World Model [54.874926065292904]
We investigate World-model-based Policy Evaluation (WPE)<n>WPE achieves high fidelity in mimicing robot arm movements as in real videos.<n>We show that WPE can serve as a starting point for evaluating robot policies before real-world deployment.
arXiv Detail & Related papers (2025-05-31T15:51:56Z)
WorldEval: World Model as Real-World Robot Policies Evaluator [13.899692171641066]
A key challenge is generating accurate policy videos from world models that faithfully reflect the robot actions.<n>We propose Policy2Vec, a simple yet effective approach to turn a video generation model into a world simulator that follows latent action to generate the robot video.<n>We then introduce WorldEval, an automated pipeline designed to evaluate real-world robot policies entirely online.
arXiv Detail & Related papers (2025-05-25T07:41:39Z)
AutoEval: A Practical Framework for Autonomous Evaluation of Mobile Agents [5.995751996623217]
We propose AutoEval, an evaluation framework which tests mobile agents without any manual effort.<n>Our approach designs a UI state change representation which can be used to automatically generate task reward signals.<n>We also evaluate state-of-the-art mobile agents using our framework, providing insights into their performance and limitations.
arXiv Detail & Related papers (2025-03-04T08:44:30Z)
Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics [68.36528819227641]
This paper systematically quantifies the robustness of VLA-based robotic systems.<n>We introduce two untargeted attack objectives that leverage spatial foundations to destabilize robotic actions, and a targeted attack objective that manipulates the robotic trajectory.<n>We design an adversarial patch generation approach that places a small, colorful patch within the camera's view, effectively executing the attack in both digital and physical environments.
arXiv Detail & Related papers (2024-11-18T01:52:20Z)
WorldSimBench: Towards Video Generation Models as World Simulators [79.69709361730865]
We classify the functionalities of predictive models into a hierarchy and take the first step in evaluating World Simulators by proposing a dual evaluation framework called WorldSimBench. WorldSimBench includes Explicit Perceptual Evaluation and Implicit Manipulative Evaluation, encompassing human preference assessments from the visual perspective and action-level evaluations in embodied tasks. Our comprehensive evaluation offers key insights that can drive further innovation in video generation models, positioning World Simulators as a pivotal advancement toward embodied artificial intelligence.
arXiv Detail & Related papers (2024-10-23T17:56:11Z)
Generalized Robot Learning Framework [10.03174544844559]
We present a low-cost robot learning framework that is both easily reproducible and transferable to various robots and environments. We demonstrate that deployable imitation learning can be successfully applied even to industrial-grade robots.
arXiv Detail & Related papers (2024-09-18T15:34:31Z)
Evaluating Real-World Robot Manipulation Policies in Simulation [91.55267186958892]
Control and visual disparities between real and simulated environments are key challenges for reliable simulated evaluation. We propose approaches for mitigating these gaps without needing to craft full-fidelity digital twins of real-world environments. We create SIMPLER, a collection of simulated environments for manipulation policy evaluation on common real robot setups.
arXiv Detail & Related papers (2024-05-09T17:30:16Z)
Autonomous and Human-Driven Vehicles Interacting in a Roundabout: A Quantitative and Qualitative Evaluation [34.67306374722473]
We learn a policy to minimize traffic jams and to minimize pollution in a roundabout in Milan, Italy. We qualitatively evaluate the learned policy using a cutting-edge cockpit to assess its performance in near-real-world conditions. Our findings show that human-driven vehicles benefit from optimizing AVs dynamics.
arXiv Detail & Related papers (2023-09-15T09:02:16Z)
From Adversarial Arms Race to Model-centric Evaluation: Motivating a Unified Automatic Robustness Evaluation Framework [91.94389491920309]
Textual adversarial attacks can discover models' weaknesses by adding semantic-preserved but misleading perturbations to the inputs. The existing practice of robustness evaluation may exhibit issues of incomprehensive evaluation, impractical evaluation protocol, and invalid adversarial samples. We set up a unified automatic robustness evaluation framework, shifting towards model-centric evaluation to exploit the advantages of adversarial attacks.
arXiv Detail & Related papers (2023-05-29T14:55:20Z)
Crowdsourcing Evaluation of Saliency-based XAI Methods [18.18238526746074]
We propose a new human-based evaluation scheme using crowdsourcing to evaluate XAI methods. Our method is inspired by a human computation game, "Peek-a-boom" We evaluate the saliency maps of various XAI methods on two datasets with automated and crowd-based evaluation schemes.
arXiv Detail & Related papers (2021-06-27T17:37:53Z)
Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation. Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.