Human vs. Agent in Task-Oriented Conversations
- URL: http://arxiv.org/abs/2509.17619v2
- Date: Wed, 05 Nov 2025 03:47:48 GMT
- Title: Human vs. Agent in Task-Oriented Conversations
- Authors: Zhefan Wang, Ning Geng, Zhiqiang Guo, Weizhi Ma, Min Zhang,
- Abstract summary: This work presents the first systematic comparison between large language models (LLMs)-simulated users and human users in personalized task-oriented conversations.<n>Our analysis reveals significant behavioral differences between the two user types in problem-solving approaches.
- Score: 22.743152820695588
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Task-oriented conversational systems are essential for efficiently addressing diverse user needs, yet their development requires substantial amounts of high-quality conversational data that is challenging and costly to obtain. While large language models (LLMs) have demonstrated potential in generating synthetic conversations, the extent to which these agent-generated interactions can effectively substitute real human conversations remains unclear. This work presents the first systematic comparison between LLM-simulated users and human users in personalized task-oriented conversations. We propose a comprehensive analytical framework encompassing three key aspects (conversation strategy, interaction style, and conversation evaluation) and ten distinct dimensions for evaluating user behaviors, and collect parallel conversational datasets from both human users and LLM agent users across four representative scenarios under identical conditions. Our analysis reveals significant behavioral differences between the two user types in problem-solving approaches, question broadness, user engagement, context dependency, feedback polarity and promise, language style, and hallucination awareness. We found consistency in the agent users and human users across the depth-first or breadth-first dimensions, as well as the usefulness dimensions. These findings provide critical insights for advancing LLM-based user simulation. Our multi-dimensional taxonomy constructed a generalizable framework for analyzing user behavior patterns, offering insights from LLM agent users and human users. By this work, we provide perspectives on rethinking how to use user simulation in conversational systems in the future.
Related papers
- Agentic Persona Control and Task State Tracking for Realistic User Simulation in Interactive Scenarios [0.0]
We present a novel multi-agent framework for realistic, explainable human user simulation in interactive scenarios.<n>We employ persona control and task state tracking to mirror human cognitive processes during goal-oriented conversations.
arXiv Detail & Related papers (2025-11-30T20:25:56Z) - TwinVoice: A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation [55.55404595177229]
Large Language Models (LLMs) are exhibiting emergent human-like abilities.<n>TwinVoice is a benchmark for assessing persona simulation across diverse real-world contexts.
arXiv Detail & Related papers (2025-10-29T14:00:42Z) - Investigating Thematic Patterns and User Preferences in LLM Interactions using BERTopic [4.087884819027264]
This study applies BERTopic to the lmsys-chat-1m dataset, a multilingual conversational corpus built from head-to-head evaluations of large language models (LLMs)<n>The main objective is uncovering thematic patterns in these conversations and examining their relation to user preferences.<n>We analysed relationships between topics and model preferences to identify trends in model-topic alignment.
arXiv Detail & Related papers (2025-10-08T21:13:44Z) - RMTBench: Benchmarking LLMs Through Multi-Turn User-Centric Role-Playing [133.0641538589466]
RMTBench is a comprehensive textbfuser-centric bilingual role-playing benchmark featuring 80 diverse characters and over 8,000 dialogue rounds.<n>Our benchmark constructs dialogues based on explicit user motivations rather than character descriptions, ensuring alignment with practical user applications.<n>By shifting focus from character background to user intention fulfillment, RMTBench bridges the gap between academic evaluation and practical deployment requirements.
arXiv Detail & Related papers (2025-07-27T16:49:47Z) - A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations [112.81207927088117]
PersonaConvBench is a benchmark for evaluating personalized reasoning and generation in multi-turn conversations with large language models (LLMs)<n>We benchmark several commercial and open-source LLMs under a unified prompting setup and observe that incorporating personalized history yields substantial performance improvements.
arXiv Detail & Related papers (2025-05-20T09:13:22Z) - Exploring the Impact of Personality Traits on Conversational Recommender Systems: A Simulation with Large Language Models [70.180385882195]
This paper introduces a personality-aware user simulation for Conversational Recommender Systems (CRSs)<n>The user agent induces customizable personality traits and preferences, while the system agent possesses the persuasion capability to simulate realistic interaction in CRSs.<n> Experimental results demonstrate that state-of-the-art LLMs can effectively generate diverse user responses aligned with specified personality traits.
arXiv Detail & Related papers (2025-04-09T13:21:17Z) - REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation [51.97224538045096]
We introduce REALTALK, a 21-day corpus of authentic messaging app dialogues.<n>We compare EI attributes and persona consistency to understand the challenges posed by real-world dialogues.<n>Our findings reveal that models struggle to simulate a user solely from dialogue history, while fine-tuning on specific user chats improves persona emulation.
arXiv Detail & Related papers (2025-02-18T20:29:01Z) - Simulating User Diversity in Task-Oriented Dialogue Systems using Large Language Models [11.708400514900053]
We employ two proprietary Large Language Models (LLMs) to generate a heterogeneous base of user profiles.<n>We perform a detailed analysis of the user profiles generated by LLMs to assess the diversity, consistency, and potential biases inherent in these simulations.<n>We find that GPT-o1 generates more heterogeneous user distribution across most user attributes, while GPT-4o generates more skewed user attributes.
arXiv Detail & Related papers (2025-02-18T12:20:16Z) - Aligning LLMs with Individual Preferences via Interaction [51.72200436159636]
We train large language models (LLMs) that can ''interact to align''<n>We develop a multi-turn preference dataset containing 3K+ multi-turn conversations in tree structures.<n>For evaluation, we establish the ALOE benchmark, consisting of 100 carefully selected examples and well-designed metrics to measure the customized alignment performance during conversations.
arXiv Detail & Related papers (2024-10-04T17:48:29Z) - Approximating Online Human Evaluation of Social Chatbots with Prompting [11.657633779338724]
Existing evaluation metrics aim to automate offline user evaluation and approximate human judgment of pre-curated dialogs.
We propose an approach to approximate online human evaluation leveraging large language models (LLMs) from the GPT family.
We introduce a new Dialog system Evaluation framework based on Prompting (DEP), which enables a fully automatic evaluation pipeline.
arXiv Detail & Related papers (2023-04-11T14:45:01Z) - Dialogue History Matters! Personalized Response Selectionin Multi-turn
Retrieval-based Chatbots [62.295373408415365]
We propose a personalized hybrid matching network (PHMN) for context-response matching.
Our contributions are two-fold: 1) our model extracts personalized wording behaviors from user-specific dialogue history as extra matching information.
We evaluate our model on two large datasets with user identification, i.e., personalized dialogue Corpus Ubuntu (P- Ubuntu) and personalized Weibo dataset (P-Weibo)
arXiv Detail & Related papers (2021-03-17T09:42:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.