Related papers: A2Eval: Agentic and Automated Evaluation for Embodied Brain

A2Eval: Agentic and Automated Evaluation for Embodied Brain

URL: http://arxiv.org/abs/2602.01640v1
Date: Mon, 02 Feb 2026 04:55:27 GMT
Title: A2Eval: Agentic and Automated Evaluation for Embodied Brain
Authors: Shuai Zhang, Jiayu Hu, Zijie Chen, Zeyuan Ding, Yi Zhang, Yingji Zhang, Ziyi Zhou, Junwei Liao, Shengjie Zhou, Yong Dai, Zhenzhong Lan, Xiaozhu Ju,
Abstract summary: Current embodied VLM evaluation relies on static, expert-defined, manually annotated benchmarks.<n>Agentic Automatic Evaluation (A2Eval) is the first agentic framework that automates benchmark curation and evaluation through two collaborative agents.<n> Evaluated across 10 benchmarks and 13 models, A2Eval compresses evaluation suites by 85%, reduces overall computational costs by 77%, and delivers a 4.6x speedup.
Score: 26.357063836707223
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current embodied VLM evaluation relies on static, expert-defined, manually annotated benchmarks that exhibit severe redundancy and coverage imbalance. This labor intensive paradigm drains computational and annotation resources, inflates costs, and distorts model rankings, ultimately stifling iterative development. To address this, we propose Agentic Automatic Evaluation (A2Eval), the first agentic framework that automates benchmark curation and evaluation through two collaborative agents. The Data Agent autonomously induces capability dimensions and assembles a balanced, compact evaluation suite, while the Eval Agent synthesizes and validates executable evaluation pipelines, enabling fully autonomous, high-fidelity assessment. Evaluated across 10 benchmarks and 13 models, A2Eval compresses evaluation suites by 85%, reduces overall computational costs by 77%, and delivers a 4.6x speedup while preserving evaluation quality. Crucially, A2Eval corrects systematic ranking biases, improves human alignment to Spearman's rho=0.85, and maintains high ranking fidelity (Kendall's tau=0.81), establishing a new standard for high-fidelity, low-cost embodied assessment. Our code and data will be public soon.

Related papers

ReLE: A Scalable System and Structured Benchmark for Diagnosing Capability Anisotropy in Chinese LLMs [37.23311145049677]
We present ReLE, a scalable system designed to diagnose Capability Anisotropy.<n>We evaluate 304 models across a Domain $times$ Capability Symbolic matrix comprising 207,843 samples.
arXiv Detail & Related papers (2026-01-24T09:57:59Z)
AutoBench: Automating LLM Evaluation through Reciprocal Peer Assessment [12.9569411072262]
AutoBench is a fully automated and self-sustaining framework for evaluating Large Language Models (LLMs)<n>This paper provides a rigorous scientific validation of the AutoBench methodology, originally developed as an open-source project by eZecute S.R.L.
arXiv Detail & Related papers (2025-10-26T09:20:39Z)
Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains [97.5573252172065]
We train a family of Automatic Reasoning Evaluators (FARE) with a simple iterative rejection-sampling supervised finetuning approach.<n>FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators.<n>As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH.
arXiv Detail & Related papers (2025-10-20T17:52:06Z)
NAIPv2: Debiased Pairwise Learning for Efficient Paper Quality Estimation [58.30936615525824]
We present NAIPv2, a debiased and efficient framework for paper quality estimation.<n> NAIPv2 employs pairwise learning within domain-year groups to reduce inconsistencies in reviewer ratings.<n>It is trained on pairwise comparisons but enabling efficient pointwise prediction at deployment.
arXiv Detail & Related papers (2025-09-29T17:59:23Z)
Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance? [2.010294990327175]
Current AI evaluation practices depend heavily on established benchmarks.<n>This research addresses the urgent need to quantify this "benchmark-regulation gap"<n>Our findings reveal a profound misalignment: the evaluation ecosystem dedicates the vast majority of its focus to a narrow set of behavioral propensities.
arXiv Detail & Related papers (2025-08-07T15:03:39Z)
The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models [53.12387628636912]
We propose an automatic evaluation framework that is validated against human annotations.<n>This approach was originally developed for the TREC Question Answering (QA) Track in 2003.<n>We observe strong agreement at the run level between scores derived from fully automatic nugget evaluation and human-based variants.
arXiv Detail & Related papers (2025-04-21T12:55:06Z)
AutoEval: A Practical Framework for Autonomous Evaluation of Mobile Agents [5.995751996623217]
We propose AutoEval, an evaluation framework which tests mobile agents without any manual effort.<n>Our approach designs a UI state change representation which can be used to automatically generate task reward signals.<n>We also evaluate state-of-the-art mobile agents using our framework, providing insights into their performance and limitations.
arXiv Detail & Related papers (2025-03-04T08:44:30Z)
Early-Exit and Instant Confidence Translation Quality Estimation [46.13074343863971]
We tackle two connected challenges: (1) reducing the cost of quality estimation at scale, and (2) developing an inexpensive uncertainty estimation method for quality estimation.<n>To address the latter, we introduce Instant Confidence COMET, an uncertainty-aware quality estimation model that matches the performance of previous approaches at a fraction of their costs.<n>We extend this to Early-Exit COMET, a quality estimation model that can compute quality scores and associated confidences already at early model layers, allowing us to early-exit computations and reduce evaluation costs.
arXiv Detail & Related papers (2025-02-20T10:27:13Z)
Autonomous Evaluation and Refinement of Digital Agents [57.12281122337407]
We show that domain-general automatic evaluators can significantly improve the performance of agents for web navigation and device control. We validate the performance of these models in several popular benchmarks for digital agents, finding between 74.4 and 92.9% agreement with oracle evaluation metrics.
arXiv Detail & Related papers (2024-04-09T17:25:47Z)
Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
From Adversarial Arms Race to Model-centric Evaluation: Motivating a Unified Automatic Robustness Evaluation Framework [91.94389491920309]
Textual adversarial attacks can discover models' weaknesses by adding semantic-preserved but misleading perturbations to the inputs. The existing practice of robustness evaluation may exhibit issues of incomprehensive evaluation, impractical evaluation protocol, and invalid adversarial samples. We set up a unified automatic robustness evaluation framework, shifting towards model-centric evaluation to exploit the advantages of adversarial attacks.
arXiv Detail & Related papers (2023-05-29T14:55:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.