Related papers: MirrorBench: An Extensible Framework to Evaluate User-Proxy Agents for Human-Likeness

MirrorBench: An Extensible Framework to Evaluate User-Proxy Agents for Human-Likeness

URL: http://arxiv.org/abs/2601.08118v1
Date: Tue, 13 Jan 2026 01:16:13 GMT
Title: MirrorBench: An Extensible Framework to Evaluate User-Proxy Agents for Human-Likeness
Authors: Ashutosh Hathidara, Julien Yu, Vaishali Senthil, Sebastian Schreiber, Anil Babu Ankisettipalli,
Abstract summary: Large language models (LLMs) are increasingly used as human simulators.<n> naive "act-as-a-user" often yields verbose, unrealistic utterances.<n>We present MIRRORBENCH, a benchmarking framework that evaluates user proxies solely on their ability to produce human-like user utterances.
Score: 0.4893345190925178
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large language models (LLMs) are increasingly used as human simulators, both for evaluating conversational systems and for generating fine-tuning data. However, naive "act-as-a-user" prompting often yields verbose, unrealistic utterances, underscoring the need for principled evaluation of so-called user proxy agents. We present MIRRORBENCH, a reproducible, extensible benchmarking framework that evaluates user proxies solely on their ability to produce human-like user utterances across diverse conversational tasks, explicitly decoupled from downstream task success. MIRRORBENCH features a modular execution engine with typed interfaces, metadata-driven registries, multi-backend support, caching, and robust observability. The system supports pluggable user proxies, datasets, tasks, and metrics, enabling researchers to evaluate arbitrary simulators under a uniform, variance-aware harness. We include three lexical-diversity metrics (MATTR, YULE'S K, and HD-D) and three LLM-judge-based metrics (GTEval, Pairwise Indistinguishability, and Rubric-and-Reason). Across four open datasets, MIRRORBENCH yields variance-aware results and reveals systematic gaps between user proxies and real human users. The framework is open source and includes a simple command-line interface for running experiments, managing configurations and caching, and generating reports. The framework can be accessed at https://github.com/SAP/mirrorbench.

Related papers

Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents [8.760287445955045]
Large language model (LLM) agents operating via multi-turn dialogue and multi-step tool calling are increasingly used in production.<n>Prior agentic benchmarks rely on fully deterministic backends, which are costly to build and iterate.<n>We propose Proxy State-Based Evaluation, an LLM-driven simulation framework that preserves final state-based evaluation without a deterministic database.
arXiv Detail & Related papers (2026-02-18T07:49:47Z)
FABRIC: Framework for Agent-Based Realistic Intelligence Creation [3.940391073007047]
Large language models (LLMs) are increasingly deployed as agents, expected to decompose goals, invoke tools, and verify results in dynamic environments.<n>We present a unified framework for synthesizing agentic data using only LLMs, without any human-in-the-loop supervision.
arXiv Detail & Related papers (2025-10-20T18:20:22Z)
Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting [92.57796055887995]
We introduce ECHO, a prompting framework that adapts hindsight experience replay from reinforcement learning for language model agents.<n> ECHO generates optimized trajectories for alternative goals that could have been achieved during failed attempts.<n>We evaluate ECHO on stateful versions of XMiniGrid, a text-based navigation and planning benchmark, and PeopleJoinQA, a collaborative information-gathering enterprise simulation.
arXiv Detail & Related papers (2025-10-11T18:11:09Z)
Saving SWE-Bench: A Benchmark Mutation Approach for Realistic Agent Evaluation [5.332969177132911]
Current benchmarks for evaluating software engineering agents, such as SWE-Bench Verified, are predominantly derived from GitHub issues.<n>We introduce a novel benchmarking framework that transforms existing formal benchmarks into realistic user queries.
arXiv Detail & Related papers (2025-10-10T04:42:02Z)
Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling [83.78874399606379]
We propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling.<n>It comprises four distinct small-scale agents, with clearly defined roles and effective collaboration.<n>It shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks.
arXiv Detail & Related papers (2025-08-05T12:52:09Z)
ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering [52.19512723549318]
We design a scalable human evaluation protocol that reflects practitioners' real-world usage of models.<n>We use this protocol to collect extensive crowdworker annotations of outputs from a diverse set of topic models.<n>We then use these annotations to validate automated proxies, finding that the best LLM proxies are statistically indistinguishable from a human annotator.
arXiv Detail & Related papers (2025-07-01T15:00:55Z)
Multimodal Information Retrieval for Open World with Edit Distance Weak Supervision [0.0]
"FemmIR" is a framework to retrieve results relevant to information needs expressed with multimodal queries by example without any similarity label.<n>We empirically evaluate FemmIR on a missing person use case with MuQNOL.
arXiv Detail & Related papers (2025-06-25T00:25:08Z)
What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities [56.646832992178105]
We introduce OmniBench, a cross-platform, graph-based benchmark with an automated pipeline for synthesizing tasks of controllable complexity.<n>We present OmniEval, a multidimensional evaluation framework that includes subtask-level evaluation, graph-based metrics, and comprehensive tests across 10 capabilities.<n>Our dataset contains 36k graph-structured tasks across 20 scenarios, achieving a 91% human acceptance rate.
arXiv Detail & Related papers (2025-06-10T15:59:38Z)
Towards Realistic Evaluation of Commit Message Generation by Matching Online and Offline Settings [77.20838441870151]
We use an online metric - the number of edits users introduce before committing the generated messages to the VCS - to select metrics for offline experiments.<n>We collect a dataset with 57 pairs consisting of commit messages generated by GPT-4 and their counterparts edited by human experts.<n>Our results indicate that edit distance exhibits the highest correlation with the online metric, whereas commonly used similarity metrics such as BLEU and METEOR demonstrate low correlation.
arXiv Detail & Related papers (2024-10-15T20:32:07Z)
On Generative Agents in Recommendation [58.42840923200071]
Agent4Rec is a user simulator in recommendation based on Large Language Models. Each agent interacts with personalized recommender models in a page-by-page manner.
arXiv Detail & Related papers (2023-10-16T06:41:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.