Related papers: CUARewardBench: A Benchmark for Evaluating Reward Models on Computer-using Agent

CUARewardBench: A Benchmark for Evaluating Reward Models on Computer-using Agent

URL: http://arxiv.org/abs/2510.18596v1
Date: Tue, 21 Oct 2025 12:53:40 GMT
Title: CUARewardBench: A Benchmark for Evaluating Reward Models on Computer-using Agent
Authors: Haojia Lin, Xiaoyu Tan, Yulei Qin, Zihan Xu, Yuchen Shi, Zongyi Li, Gang Li, Shaofei Cai, Siqi Cai, Chaoyou Fu, Ke Li, Xing Sun,
Abstract summary: Computer-using agents (CUAs) enable task completion through natural interaction with operating systems and software interfaces.<n> Reward models offer promising alternatives, but their effectiveness on CUA evaluation remains largely underexplored.<n>We present CUARewardBench, comprising four key contributions.
Score: 46.41047559759938
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Computer-using agents (CUAs) enable task completion through natural interaction with operating systems and software interfaces. While script-based verifiers are widely adopted for evaluation, they suffer from limited scalability and inability to provide step-wise assessment. Reward models offer promising alternatives, but their effectiveness on CUA evaluation remains largely underexplored. To address this gap, we present CUARewardBench, comprising four key contributions: (1) First-ever Comprehensive CUA Reward Benchmark: We introduce the first benchmark for evaluating both outcome reward models (ORM) and process reward models (PRM) on CUA tasks, enabling systematic assessment across trajectory-level and step-level evaluation. (2) Diverse, Practical and Reliable Dataset: CUARewardBench encompasses trajectories from 10 software categories and 7 agent architectures with varying performance levels (25.9%-50.8% success rates). All trajectories are expertly annotated through carefully designed protocols, with rigorous quality control to ensure reliability and practical applicability. (3) Comprehensive Analysis and Insights: Through extensive experiments across 7 vision-language models and 3 prompt templates, we reveal critical limitations of current CUA RMs, including insufficient visual reasoning capabilities, knowledge deficiencies, and the superiority of general VLMs over specialized CUA models for reward evaluation. (4) Unanimous Prompt Ensemble (UPE): Based on the insights from our comprehensive analysis, we propose UPE, a novel ensemble method that significantly enhances reward model reliability through strict unanimous voting and strategic prompt-template configurations. UPE achieves 89.8% precision and 93.3% NPV for ORM, and 81.7% precision and 85.1% NPV for PRM, substantially outperforming single VLMs and traditional ensemble approaches.

Related papers

SWE-RM: Execution-free Feedback For Software Engineering Agents [61.86380395896069]
Execution-based feedback is widely used in the development of coding agents through test-time scaling (TTS) and reinforcement learning (RL)<n>In contrast, execution-free feedback from reward models can provide more fine-grained signals without depending on unit test cases.<n>We introduce SWE-RM, an accurate and robust reward model adopting a mixture-of-experts architecture with 30B total parameters and 3B activated during inference.
arXiv Detail & Related papers (2025-12-26T08:26:18Z)
DrawingBench: Evaluating Spatial Reasoning and UI Interaction Capabilities of Large Language Models through Mouse-Based Drawing Tasks [10.977990951788422]
DrawingBench is a verification framework for evaluating the trustworthiness of agentic LLMs.<n>Our framework comprises 250 diverse prompts across 20 categories and 4 difficulty levels.<n>We evaluate four state-of-the-art LLMs across 1,000 tests.
arXiv Detail & Related papers (2025-12-01T01:18:21Z)
Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains [97.5573252172065]
We train a family of Automatic Reasoning Evaluators (FARE) with a simple iterative rejection-sampling supervised finetuning approach.<n>FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators.<n>As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH.
arXiv Detail & Related papers (2025-10-20T17:52:06Z)
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning.<n>Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate.<n>We propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision.
arXiv Detail & Related papers (2025-06-19T08:49:13Z)
Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models [19.924422958846144]
We present Athena-PRM, a process reward model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems.<n>Our Athena-PRM consistently achieves superior performance across multiple benchmarks and scenarios.
arXiv Detail & Related papers (2025-06-11T09:01:59Z)
Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback [59.078756231841574]
Critique-GRPO is an online RL framework that integrates both natural language and numerical feedback for effective policy optimization.<n>We show Critique-GRPO consistently outperforms supervised learning and RL-based fine-tuning methods across eight challenging mathematical, STEM, and general reasoning tasks.
arXiv Detail & Related papers (2025-06-03T17:39:02Z)
R-PRM: Reasoning-Driven Process Reward Modeling [53.06844294668382]
Process Reward Models (PRMs) have emerged as a promising solution by evaluating each reasoning step.<n>Existing PRMs typically output evaluation scores directly, limiting both learning efficiency and evaluation accuracy.<n>We propose Reasoning-Driven Process Reward Modeling (R-PRM)<n>R-PRM generates seed data from limited annotations, effectively bootstrapping our model's reasoning capabilities.
arXiv Detail & Related papers (2025-03-27T09:23:08Z)
AURORA:Automated Training Framework of Universal Process Reward Models via Ensemble Prompting and Reverse Verification [31.463529258956452]
We present AURORA, a novel framework for training universal process reward models (PRMs) using ensemble prompting and reverse verification.<n>The framework employs a two-phase approach: First, it uses diverse prompting strategies and ensemble methods to perform automated annotation and evaluation of processes.<n>To assess the framework's performance, we extend beyond the existing ProcessBench benchmark by introducing UniversalBench.
arXiv Detail & Related papers (2025-02-17T07:41:27Z)
PCA-Bench: Evaluating Multimodal Large Language Models in Perception-Cognition-Action Chain [37.448177723993346]
We present PCA-Bench, a benchmark for evaluating the integrated capabilities of Multimodal Large Language Models (MLLMs) Given task instructions and diverse contexts, the model is required to seamlessly integrate Perception, Cognition, and Action in a reasoning chain. We propose PCA-Eval, an automatic evaluation protocol, and assess 10 prevalent MLLMs.
arXiv Detail & Related papers (2024-02-21T07:09:58Z)
Tool-Augmented Reward Modeling [58.381678612409]
We propose a tool-augmented preference modeling approach, named Themis, to address limitations by empowering RMs with access to external environments. Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources. In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines.
arXiv Detail & Related papers (2023-10-02T09:47:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.