MirrorBench: An Extensible Framework to Evaluate User-Proxy Agents for Human-Likeness
- URL: http://arxiv.org/abs/2601.08118v1
- Date: Tue, 13 Jan 2026 01:16:13 GMT
- Title: MirrorBench: An Extensible Framework to Evaluate User-Proxy Agents for Human-Likeness
- Authors: Ashutosh Hathidara, Julien Yu, Vaishali Senthil, Sebastian Schreiber, Anil Babu Ankisettipalli,
- Abstract summary: Large language models (LLMs) are increasingly used as human simulators.<n> naive "act-as-a-user" often yields verbose, unrealistic utterances.<n>We present MIRRORBENCH, a benchmarking framework that evaluates user proxies solely on their ability to produce human-like user utterances.
- Score: 0.4893345190925178
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large language models (LLMs) are increasingly used as human simulators, both for evaluating conversational systems and for generating fine-tuning data. However, naive "act-as-a-user" prompting often yields verbose, unrealistic utterances, underscoring the need for principled evaluation of so-called user proxy agents. We present MIRRORBENCH, a reproducible, extensible benchmarking framework that evaluates user proxies solely on their ability to produce human-like user utterances across diverse conversational tasks, explicitly decoupled from downstream task success. MIRRORBENCH features a modular execution engine with typed interfaces, metadata-driven registries, multi-backend support, caching, and robust observability. The system supports pluggable user proxies, datasets, tasks, and metrics, enabling researchers to evaluate arbitrary simulators under a uniform, variance-aware harness. We include three lexical-diversity metrics (MATTR, YULE'S K, and HD-D) and three LLM-judge-based metrics (GTEval, Pairwise Indistinguishability, and Rubric-and-Reason). Across four open datasets, MIRRORBENCH yields variance-aware results and reveals systematic gaps between user proxies and real human users. The framework is open source and includes a simple command-line interface for running experiments, managing configurations and caching, and generating reports. The framework can be accessed at https://github.com/SAP/mirrorbench.
Related papers
- Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents [8.760287445955045]
Large language model (LLM) agents operating via multi-turn dialogue and multi-step tool calling are increasingly used in production.<n>Prior agentic benchmarks rely on fully deterministic backends, which are costly to build and iterate.<n>We propose Proxy State-Based Evaluation, an LLM-driven simulation framework that preserves final state-based evaluation without a deterministic database.
arXiv Detail & Related papers (2026-02-18T07:49:47Z) - FABRIC: Framework for Agent-Based Realistic Intelligence Creation [3.940391073007047]
Large language models (LLMs) are increasingly deployed as agents, expected to decompose goals, invoke tools, and verify results in dynamic environments.<n>We present a unified framework for synthesizing agentic data using only LLMs, without any human-in-the-loop supervision.
arXiv Detail & Related papers (2025-10-20T18:20:22Z) - Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting [92.57796055887995]
We introduce ECHO, a prompting framework that adapts hindsight experience replay from reinforcement learning for language model agents.<n> ECHO generates optimized trajectories for alternative goals that could have been achieved during failed attempts.<n>We evaluate ECHO on stateful versions of XMiniGrid, a text-based navigation and planning benchmark, and PeopleJoinQA, a collaborative information-gathering enterprise simulation.
arXiv Detail & Related papers (2025-10-11T18:11:09Z) - Saving SWE-Bench: A Benchmark Mutation Approach for Realistic Agent Evaluation [5.332969177132911]
Current benchmarks for evaluating software engineering agents, such as SWE-Bench Verified, are predominantly derived from GitHub issues.<n>We introduce a novel benchmarking framework that transforms existing formal benchmarks into realistic user queries.
arXiv Detail & Related papers (2025-10-10T04:42:02Z) - Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling [83.78874399606379]
We propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling.<n>It comprises four distinct small-scale agents, with clearly defined roles and effective collaboration.<n>It shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks.
arXiv Detail & Related papers (2025-08-05T12:52:09Z) - ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering [52.19512723549318]
We design a scalable human evaluation protocol that reflects practitioners' real-world usage of models.<n>We use this protocol to collect extensive crowdworker annotations of outputs from a diverse set of topic models.<n>We then use these annotations to validate automated proxies, finding that the best LLM proxies are statistically indistinguishable from a human annotator.
arXiv Detail & Related papers (2025-07-01T15:00:55Z) - Multimodal Information Retrieval for Open World with Edit Distance Weak Supervision [0.0]
"FemmIR" is a framework to retrieve results relevant to information needs expressed with multimodal queries by example without any similarity label.<n>We empirically evaluate FemmIR on a missing person use case with MuQNOL.
arXiv Detail & Related papers (2025-06-25T00:25:08Z) - What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities [56.646832992178105]
We introduce OmniBench, a cross-platform, graph-based benchmark with an automated pipeline for synthesizing tasks of controllable complexity.<n>We present OmniEval, a multidimensional evaluation framework that includes subtask-level evaluation, graph-based metrics, and comprehensive tests across 10 capabilities.<n>Our dataset contains 36k graph-structured tasks across 20 scenarios, achieving a 91% human acceptance rate.
arXiv Detail & Related papers (2025-06-10T15:59:38Z) - Towards Realistic Evaluation of Commit Message Generation by Matching Online and Offline Settings [77.20838441870151]
We use an online metric - the number of edits users introduce before committing the generated messages to the VCS - to select metrics for offline experiments.<n>We collect a dataset with 57 pairs consisting of commit messages generated by GPT-4 and their counterparts edited by human experts.<n>Our results indicate that edit distance exhibits the highest correlation with the online metric, whereas commonly used similarity metrics such as BLEU and METEOR demonstrate low correlation.
arXiv Detail & Related papers (2024-10-15T20:32:07Z) - On Generative Agents in Recommendation [58.42840923200071]
Agent4Rec is a user simulator in recommendation based on Large Language Models.
Each agent interacts with personalized recommender models in a page-by-page manner.
arXiv Detail & Related papers (2023-10-16T06:41:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.