Related papers: SAGE: A Top-Down Bottom-Up Knowledge-Grounded User Simulator for Multi-turn AGent Evaluation

SAGE: A Top-Down Bottom-Up Knowledge-Grounded User Simulator for Multi-turn AGent Evaluation

URL: http://arxiv.org/abs/2510.11997v1
Date: Mon, 13 Oct 2025 22:52:17 GMT
Title: SAGE: A Top-Down Bottom-Up Knowledge-Grounded User Simulator for Multi-turn AGent Evaluation
Authors: Ryan Shea, Yunan Lu, Liang Qiu, Zhou Yu,
Abstract summary: We propose SAGE, a novel user Simulation framework for multi-turn AGent Evaluation.<n> SAGE incorporates top-down knowledge rooted in business logic, such as ideal customer profiles.<n>We find that this approach produces interactions that are more realistic and diverse, while also identifying up to 33% more agent errors.
Score: 17.11268616243772
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating multi-turn interactive agents is challenging due to the need for human assessment. Evaluation with simulated users has been introduced as an alternative, however existing approaches typically model generic users and overlook the domain-specific principles required to capture realistic behavior. We propose SAGE, a novel user Simulation framework for multi-turn AGent Evaluation that integrates knowledge from business contexts. SAGE incorporates top-down knowledge rooted in business logic, such as ideal customer profiles, grounding user behavior in realistic customer personas. We further integrate bottom-up knowledge taken from business agent infrastructure (e.g., product catalogs, FAQs, and knowledge bases), allowing the simulator to generate interactions that reflect users' information needs and expectations in a company's target market. Through empirical evaluation, we find that this approach produces interactions that are more realistic and diverse, while also identifying up to 33% more agent errors, highlighting its effectiveness as an evaluation tool to support bug-finding and iterative agent improvement.

Related papers

Exploring Recommender System Evaluation: A Multi-Modal User Agent Framework for A/B Testing [54.456400601801704]
We introduce a multi-modal user agent for A/B testing (A/B Agent)<n>Specifically, we construct a recommendation sandbox environment for A/B testing, enabling multimodal and multi-page interactions.<n>We validated the potential of the agent as an alternative to traditional A/B testing from three perspectives: model, data, and features.
arXiv Detail & Related papers (2026-01-08T03:33:43Z)
Agentic Persona Control and Task State Tracking for Realistic User Simulation in Interactive Scenarios [0.0]
We present a novel multi-agent framework for realistic, explainable human user simulation in interactive scenarios.<n>We employ persona control and task state tracking to mirror human cognitive processes during goal-oriented conversations.
arXiv Detail & Related papers (2025-11-30T20:25:56Z)
OutboundEval: A Dual-Dimensional Benchmark for Expert-Level Intelligent Outbound Evaluation of Xbench's Professional-Aligned Series [36.88936933010042]
OutboundEval is a comprehensive benchmark for evaluating large language models (LLMs) in intelligent outbound calling scenarios.<n>We design a benchmark spanning six major business domains and 30 representative sub-scenarios, each with scenario-specific process decomposition, weighted scoring, and domain-adaptive metrics.<n>We introduce a dynamic evaluation method that adapts to task variations, integrating automated and human-in-the-loop assessment to measure task execution accuracy, professional knowledge application, adaptability, and user experience quality.
arXiv Detail & Related papers (2025-10-24T08:27:58Z)
How can we assess human-agent interactions? Case studies in software agent design [52.953425368394306]
We make two major steps towards the rigorous assessment of human-agent interactions.<n>We propose PULSE, a framework for more efficient human-centric evaluation of agent designs.<n>We deploy the framework on a large-scale web platform built around the open-source software agent OpenHands.
arXiv Detail & Related papers (2025-10-10T19:04:28Z)
Beyond Static Evaluation: Rethinking the Assessment of Personalized Agent Adaptability in Information Retrieval [12.058221341033835]
We propose a conceptual lens for rethinking evaluation in adaptive personalization.<n>We organize this lens around three core components: (1) persona-based user simulation with temporally evolving preference models; (2) structured elicitation protocols inspired by reference interviews to extract preferences in context; and (3) adaptation-aware evaluation mechanisms that measure how agent behavior improves across sessions and tasks.
arXiv Detail & Related papers (2025-10-05T00:35:37Z)
RecoWorld: Building Simulated Environments for Agentic Recommender Systems [55.979427290369216]
We present RecoWorld, a blueprint for building simulated environments tailored to agentic recommender systems.<n>A user simulator reviews recommended items, updates its mindset, and when sensing potential user disengagement, generates reflective instructions.<n>The agentic recommender adapts its recommendations by incorporating these user instructions and reasoning traces, creating a dynamic feedback loop.
arXiv Detail & Related papers (2025-09-12T16:44:34Z)
JudgeAgent: Knowledge-wise and Dynamic LLM Evaluation with Agent-as-Interviewer [19.09571232466437]
We propose Agent-as-Interviewer, a dynamic evaluation paradigm for large language models (LLMs)<n>Unlike current benchmarking or dynamic interaction paradigms, Agent-as-Interviewer utilizes agents to invoke knowledge tools for wider and deeper knowledge in the dynamic multi-turn question generation.<n>We develop JudgeAgent, a knowledge-wise dynamic evaluation framework that employs knowledge-driven synthesis as the agent's tool and uses difficulty scoring as strategy guidance.
arXiv Detail & Related papers (2025-09-02T08:52:16Z)
CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions [85.88573535033406]
CRMArena-Pro is a novel benchmark for holistic, realistic assessment of LLM agents in diverse professional settings.<n>It incorporates multi-turn interactions guided by diverse personas and robust confidentiality awareness assessments.<n>Experiments reveal leading LLM agents achieve only around 58% single-turn success on CRMArena-Pro, with performance dropping significantly to approximately 35% in multi-turn settings.
arXiv Detail & Related papers (2025-05-24T21:33:22Z)
Dynamic Evaluation Framework for Personalized and Trustworthy Agents: A Multi-Session Approach to Preference Adaptability [10.443994990138973]
We argue for a paradigm shift in evaluating personalized and adaptive agents.<n>We propose a comprehensive novel framework that models user personas with unique attributes and preferences.<n>Our flexible framework is designed to support a variety of agents and applications, ensuring a comprehensive and versatile evaluation of recommendation strategies.
arXiv Detail & Related papers (2025-03-08T22:50:26Z)
Tell Me More! Towards Implicit User Intention Understanding of Language Model Driven Agents [110.25679611755962]
Current language model-driven agents often lack mechanisms for effective user participation, which is crucial given the vagueness commonly found in user instructions. We introduce Intention-in-Interaction (IN3), a novel benchmark designed to inspect users' implicit intentions through explicit queries. We empirically train Mistral-Interact, a powerful model that proactively assesses task vagueness, inquires user intentions, and refines them into actionable goals.
arXiv Detail & Related papers (2024-02-14T14:36:30Z)
A Meta-learning based Stacked Regression Approach for Customer Lifetime Value Prediction [3.6002910014361857]
Customer Lifetime Value (CLV) is the total monetary value of transactions/purchases made by a customer with the business over an intended period of time. CLV finds application in a number of distinct business domains such as Banking, Insurance, Online-entertainment, Gaming, and E-Commerce. We propose a system which is able to qualify both as effective, and comprehensive yet simple and interpretable.
arXiv Detail & Related papers (2023-08-07T14:22:02Z)
User Behavior Simulation with Large Language Model based Agents [116.74368915420065]
We propose an LLM-based agent framework and design a sandbox environment to simulate real user behaviors. Based on extensive experiments, we find that the simulated behaviors of our method are very close to the ones of real humans.
arXiv Detail & Related papers (2023-06-05T02:58:35Z)
Metaphorical User Simulators for Evaluating Task-oriented Dialogue Systems [80.77917437785773]
Task-oriented dialogue systems ( TDSs) are assessed mainly in an offline setting or through human evaluation. We propose a metaphorical user simulator for end-to-end TDS evaluation, where we define a simulator to be metaphorical if it simulates user's analogical thinking in interactions with systems. We also propose a tester-based evaluation framework to generate variants, i.e., dialogue systems with different capabilities.
arXiv Detail & Related papers (2022-04-02T05:11:03Z)
Unsatisfied Today, Satisfied Tomorrow: a simulation framework for performance evaluation of crowdsourcing-based network monitoring [68.8204255655161]
We propose an empirical framework tailored to assess the quality of the detection of under-performing cells. The framework simulates both the processes of satisfaction surveys delivery and users satisfaction prediction. We use the simulation framework to test empirically the performance of under-performing sites detection in general scenarios.
arXiv Detail & Related papers (2020-10-30T10:03:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.