Related papers: SimulatorArena: Are User Simulators Reliable Proxies for Multi-Turn Evaluation of AI Assistants?

SimulatorArena: Are User Simulators Reliable Proxies for Multi-Turn Evaluation of AI Assistants?

URL: http://arxiv.org/abs/2510.05444v2
Date: Wed, 08 Oct 2025 21:12:53 GMT
Title: SimulatorArena: Are User Simulators Reliable Proxies for Multi-Turn Evaluation of AI Assistants?
Authors: Yao Dou, Michel Galley, Baolin Peng, Chris Kedzie, Weixin Cai, Alan Ritter, Chris Quirk, Wei Xu, Jianfeng Gao,
Abstract summary: Large language models (LLMs) are increasingly used in interactive applications.<n>Human evaluation remains the gold standard for assessing their performance in multi-turn conversations.<n>We introduce SimulatorArena, a benchmark of 909 annotated human-LLM conversations on two interactive tasks.
Score: 61.07963107032645
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly used in interactive applications, and human evaluation remains the gold standard for assessing their performance in multi-turn conversations. Since human studies are costly, time-consuming, and hard to reproduce, recent work explores using LLMs to simulate users for automatic assistant evaluation. However, there is no benchmark or systematic study to evaluate whether these simulated users are reliable stand-ins for real users. To address this, we introduce SimulatorArena, a benchmark of 909 annotated human-LLM conversations on two interactive tasks -- math tutoring and document creation. SimulatorArena evaluates simulators based on how closely their messages match human behavior and how well their assistant ratings align with human judgments. Experiments on various simulator methods show that simulators conditioned on user profiles, capturing traits like background and message styles, align closely with human judgments. They reach Spearman's $\rho$ of 0.7 on both tasks, providing a practical, scalable alternative to human evaluation. Using the best simulator for each task, we benchmark 18 assistants, including the latest LLMs such as GPT-5, Claude 4.1 Opus, and Gemini 2.5 Pro.

Related papers

ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders [48.83868690303791]
We introduce ConvApparel, a new dataset of human-AI conversations designed to address this gap.<n>Its unique dual-agent data collection protocol -- using both "good" and "bad" recommenders -- enables counterfactual validation.<n>We propose a comprehensive validation framework that combines statistical alignment, a human-likeness score, and counterfactual validation.
arXiv Detail & Related papers (2026-02-18T23:00:21Z)
Flipping the Dialogue: Training and Evaluating User Language Models [31.119620506835677]
We introduce purpose-built User Language Models (User LMs)<n>User LMs are models post-trained to simulate human users in multi-turn conversations.<n>We show how User LMs align better with human behavior and achieve better simulation robustness than existing simulation methods.
arXiv Detail & Related papers (2025-10-08T01:04:36Z)
YuLan-OneSim: Towards the Next Generation of Social Simulator with Large Language Models [50.35333054932747]
We introduce a novel social simulator called YuLan-OneSim.<n>Users can simply describe and refine their simulation scenarios through natural language interactions with our simulator.<n>We implement 50 default simulation scenarios spanning 8 domains, including economics, sociology, politics, psychology, organization, demographics, law, and communication.
arXiv Detail & Related papers (2025-05-12T14:05:17Z)
Can LLMs Simulate Personas with Reversed Performance? A Benchmark for Counterfactual Instruction Following [12.145213376813155]
Large Language Models (LLMs) are increasingly widely used to simulate personas in virtual environments.<n>We show that even state-of-the-art LLMs cannot simulate personas with reversed performance.
arXiv Detail & Related papers (2025-04-08T22:00:32Z)
A LLM-based Controllable, Scalable, Human-Involved User Simulator Framework for Conversational Recommender Systems [14.646529557978512]
Conversational Recommender System (CRS) leverages real-time feedback from users to dynamically model their preferences. Large Language Models (LLMs) has marked the onset of a new epoch in computational capabilities. We introduce a Controllable, scalable, and human-Involved (CSHI) simulator framework that manages the behavior of user simulators.
arXiv Detail & Related papers (2024-05-13T03:02:56Z)
How Reliable is Your Simulator? Analysis on the Limitations of Current LLM-based User Simulators for Conversational Recommendation [14.646529557978512]
We analyze the limitations of using Large Language Models in constructing user simulators for Conversational Recommender System. Data leakage, which occurs in conversational history and the user simulator's replies, results in inflated evaluation results. We propose SimpleUserSim, employing a straightforward strategy to guide the topic toward the target items.
arXiv Detail & Related papers (2024-03-25T04:21:06Z)
User Behavior Simulation with Large Language Model based Agents [116.74368915420065]
We propose an LLM-based agent framework and design a sandbox environment to simulate real user behaviors. Based on extensive experiments, we find that the simulated behaviors of our method are very close to the ones of real humans.
arXiv Detail & Related papers (2023-06-05T02:58:35Z)
HandoverSim: A Simulation Framework and Benchmark for Human-to-Robot Object Handovers [60.45158007016316]
"HandoverSim" is a simulation benchmark for human-to-robot object handovers. We leverage a recent motion capture dataset of hand grasping of objects. We create training and evaluation environments for the receiver with standardized protocols and metrics.
arXiv Detail & Related papers (2022-05-19T17:59:00Z)
Metaphorical User Simulators for Evaluating Task-oriented Dialogue Systems [80.77917437785773]
Task-oriented dialogue systems ( TDSs) are assessed mainly in an offline setting or through human evaluation. We propose a metaphorical user simulator for end-to-end TDS evaluation, where we define a simulator to be metaphorical if it simulates user's analogical thinking in interactions with systems. We also propose a tester-based evaluation framework to generate variants, i.e., dialogue systems with different capabilities.
arXiv Detail & Related papers (2022-04-02T05:11:03Z)
Multi-Agent Task-Oriented Dialog Policy Learning with Role-Aware Reward Decomposition [64.06167416127386]
We propose Multi-Agent Dialog Policy Learning, which regards both the system and the user as the dialog agents. Two agents interact with each other and are jointly learned simultaneously. Results show that our method can successfully build a system policy and a user policy simultaneously.
arXiv Detail & Related papers (2020-04-08T04:51:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.