Related papers: Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents

Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents

URL: http://arxiv.org/abs/2602.16246v2
Date: Sun, 22 Feb 2026 05:28:43 GMT
Title: Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents
Authors: Yun-Shiuan Chuang, Chaitanya Kulkarni, Alec Chiu, Avinash Thangali, Zijie Pan, Shivani Shekhar, Yirou Ge, Yixi Li, Uma Kona, Linsey Pang, Prakhar Mehrotra,
Abstract summary: Large language model (LLM) agents operating via multi-turn dialogue and multi-step tool calling are increasingly used in production.<n>Prior agentic benchmarks rely on fully deterministic backends, which are costly to build and iterate.<n>We propose Proxy State-Based Evaluation, an LLM-driven simulation framework that preserves final state-based evaluation without a deterministic database.
Score: 8.760287445955045
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Interactive large language model (LLM) agents operating via multi-turn dialogue and multi-step tool calling are increasingly used in production. Benchmarks for these agents must both reliably compare models and yield on-policy training data. Prior agentic benchmarks (e.g., tau-bench, tau2-bench, AppWorld) rely on fully deterministic backends, which are costly to build and iterate. We propose Proxy State-Based Evaluation, an LLM-driven simulation framework that preserves final state-based evaluation without a deterministic database. Specifically, a scenario specifies the user goal, user/system facts, expected final state, and expected agent behavior, and an LLM state tracker infers a structured proxy state from the full interaction trace. LLM judges then verify goal completion and detect tool/user hallucinations against scenario constraints. Empirically, our benchmark produces stable, model-differentiating rankings across families and inference-time reasoning efforts, and its on-/off-policy rollouts provide supervision that transfers to unseen scenarios. Careful scenario specification yields near-zero simulator hallucination rates as supported by ablation studies. The framework also supports sensitivity analyses over user personas. Human-LLM judge agreement exceeds 90%, indicating reliable automated evaluation. Overall, proxy state-based evaluation offers a practical, scalable alternative to deterministic agentic benchmarks for industrial LLM agents.

Related papers

Claim Automation using Large Language Model [0.0]
Large Language Models (LLMs) have achieved strong performance on general-purpose language tasks, but their deployment in regulated and data-sensitive domains, including insurance, remains limited.<n>We propose a locally deployed governance-aware language modeling component that generates structured corrective-action recommendations from unstructured claim narratives.<n>We fine-tune pretrained LLMs using Low-Rank Adaptation (LoRA), scoping the model to an initial decision module within the claim processing pipeline to speed up claim adjusters' decisions.
arXiv Detail & Related papers (2026-02-18T20:01:12Z)
MirrorBench: An Extensible Framework to Evaluate User-Proxy Agents for Human-Likeness [0.4893345190925178]
Large language models (LLMs) are increasingly used as human simulators.<n> naive "act-as-a-user" often yields verbose, unrealistic utterances.<n>We present MIRRORBENCH, a benchmarking framework that evaluates user proxies solely on their ability to produce human-like user utterances.
arXiv Detail & Related papers (2026-01-13T01:16:13Z)
How can we assess human-agent interactions? Case studies in software agent design [52.953425368394306]
We make two major steps towards the rigorous assessment of human-agent interactions.<n>We propose PULSE, a framework for more efficient human-centric evaluation of agent designs.<n>We deploy the framework on a large-scale web platform built around the open-source software agent OpenHands.
arXiv Detail & Related papers (2025-10-10T19:04:28Z)
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models [76.72220653705679]
We introduce MCPEval, an open-source framework that automates end-to-end task generation and deep evaluation of intelligent agents.<n> MCPEval standardizes metrics, seamlessly integrates with native agent tools, and eliminates manual effort in building evaluation pipelines.<n> Empirical results across five real-world domains show its effectiveness in revealing nuanced, domain-specific performance.
arXiv Detail & Related papers (2025-07-17T05:46:27Z)
Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification [17.67273082468732]
Verifiers -- functions assigning rewards to agent behavior -- have been key for AI progress in domains like math and board games.<n>We evaluate Multimodal Large Language Models (MLLMs) as verifiers of agent trajectories across web navigation, computer use, and robotic manipulation.<n>We propose Self-Grounded Verification (SGV), a lightweight method that enables more effective use of MLLMs' knowledge and reasoning.
arXiv Detail & Related papers (2025-07-15T18:50:29Z)
ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks [64.86209459039313]
ThinkGeo is an agentic benchmark designed to evaluate tool-augmented agents on remote sensing tasks via structured tool use and multi-step planning.<n>We implement a ReAct-style interaction loop and evaluate both open and closed-source LLMs on 486 structured agentic tasks with 1,773 expert-verified reasoning steps.<n>Our analysis reveals notable disparities in tool accuracy and planning consistency across models.
arXiv Detail & Related papers (2025-05-29T17:59:38Z)
Integrating Counterfactual Simulations with Language Models for Explaining Multi-Agent Behaviour [35.19786322586909]
We propose Agentic eXplanations via Interrogative Simulation (AXIS)<n>AXIS generates human-centred action explanations for multi-agent policies.<n>We evaluate AXIS on autonomous driving across ten scenarios for five LLMs.
arXiv Detail & Related papers (2025-05-23T12:19:18Z)
IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z)
TALE: A Tool-Augmented Framework for Reference-Free Evaluation of Large Language Models [16.857263524133284]
Large Language Models (LLMs) are increasingly integrated into real-world, autonomous applications.<n> relying on static, pre-annotated references for evaluation poses significant challenges in cost, scalability, and completeness.<n>We propose Tool-Augmented LLM Evaluation (TALE), a framework to assess LLM outputs without predetermined ground-truth answers.
arXiv Detail & Related papers (2025-04-10T02:08:41Z)
TestAgent: Automatic Benchmarking and Exploratory Interaction for Evaluating LLMs in Vertical Domains [19.492393243160244]
Large Language Models (LLMs) are increasingly deployed in highly specialized vertical domains.<n>Existing evaluations for vertical domains typically rely on the labor-intensive construction of static single-turn datasets.<n>We propose TestAgent, a framework for automatic benchmarking and exploratory dynamic evaluation in vertical domains.
arXiv Detail & Related papers (2024-10-15T11:20:42Z)
ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based Evaluation [48.54271457765236]
Large Language Models (LLMs) can elicit unintended and even harmful content when misaligned with human values. Current evaluation benchmarks predominantly employ expert-designed contextual scenarios to assess how well LLMs align with human values. We propose ALI-Agent, an evaluation framework that leverages the autonomous abilities of LLM-powered agents to conduct in-depth and adaptive alignment assessments.
arXiv Detail & Related papers (2024-05-23T02:57:42Z)
Bring Your Own Data! Self-Supervised Evaluation for Large Language Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs) We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence. We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.