Simulating Before Planning: Constructing Intrinsic User World Model for User-Tailored Dialogue Policy Planning
- URL: http://arxiv.org/abs/2504.13643v1
- Date: Fri, 18 Apr 2025 11:48:55 GMT
- Title: Simulating Before Planning: Constructing Intrinsic User World Model for User-Tailored Dialogue Policy Planning
- Authors: Tao He, Lizi Liao, Ming Liu, Bing Qin,
- Abstract summary: We present the User-Tailored Dialogue Policy Planning (UDP) framework, which incorporates an Intrinsic User World Model to model user traits and feedback.<n>UDP operates in three stages: (1) User Persona Portraying, using a diffusion model to dynamically infer user profiles; (2) User Feedback Anticipating, leveraging a Brownian Bridge-inspired anticipator to predict user reactions; and (3) User-Tailored Policy Planning, integrating these insights to optimize response strategies.
- Score: 31.785493263807684
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in dialogue policy planning have emphasized optimizing system agent policies to achieve predefined goals, focusing on strategy design, trajectory acquisition, and efficient training paradigms. However, these approaches often overlook the critical role of user characteristics, which are essential in real-world scenarios like conversational search and recommendation, where interactions must adapt to individual user traits such as personality, preferences, and goals. To address this gap, we first conduct a comprehensive study utilizing task-specific user personas to systematically assess dialogue policy planning under diverse user behaviors. By leveraging realistic user profiles for different tasks, our study reveals significant limitations in existing approaches, highlighting the need for user-tailored dialogue policy planning. Building on this foundation, we present the User-Tailored Dialogue Policy Planning (UDP) framework, which incorporates an Intrinsic User World Model to model user traits and feedback. UDP operates in three stages: (1) User Persona Portraying, using a diffusion model to dynamically infer user profiles; (2) User Feedback Anticipating, leveraging a Brownian Bridge-inspired anticipator to predict user reactions; and (3) User-Tailored Policy Planning, integrating these insights to optimize response strategies. To ensure robust performance, we further propose an active learning approach that prioritizes challenging user personas during training. Comprehensive experiments on benchmarks, including collaborative and non-collaborative settings, demonstrate the effectiveness of UDP in learning user-specific dialogue strategies. Results validate the protocol's utility and highlight UDP's robustness, adaptability, and potential to advance user-centric dialogue systems.
Related papers
- UserLM-R1: Modeling Human Reasoning in User Language Models with Multi-Reward Reinforcement Learning [32.51053667574764]
We propose UserLM-R1, a novel user language model with reasoning capability.<n>We first construct comprehensive user profiles with both static roles and dynamic scenario-specific goals for adaptation to diverse scenarios.<n>Then, we propose a goal-driven decision-making policy to generate high-quality rationales before producing responses.
arXiv Detail & Related papers (2026-01-14T06:42:01Z) - A General Highly Accurate Online Planning Method Integrating Large Language Models into Nested Rollout Policy Adaptation for Dialogue Tasks [16.400192943577743]
In goal-oriented dialogue tasks, the main challenge is to steer the interaction towards a given goal within a limited number of turns.<n>Existing approaches either rely on elaborate prompt engineering, or integrate policy networks and pre-trained policy models.<n>We present Nested Rollout Policy Adaptation for Goal-oriented Dialogue (NRPA-GD), a novel dialogue policy planning method.
arXiv Detail & Related papers (2025-11-17T02:48:37Z) - Training Proactive and Personalized LLM Agents [107.57805582180315]
We introduce PPP, a multi-objective reinforcement learning approach that jointly optimize all three dimensions: Productivity, Proactivity, and Personalization.<n>Experiments show that agents trained with PPP achieve substantial improvements over strong baselines such as GPT-5 (+21.6 on average)<n>This work demonstrates that explicitly optimizing for user-centered interaction is critical for building practical and effective AI agents.
arXiv Detail & Related papers (2025-11-04T02:59:36Z) - PRINCIPLES: Synthetic Strategy Memory for Proactive Dialogue Agents [16.819463022406627]
We propose PRINCIPLES: a synthetic strategy memory for proactive dialogue agents.<n> PRINCIPLES is derived through offline self-play simulations and serves as reusable knowledge that guides strategy planning.<n>We evaluate PRINCIPLES in both emotional support and persuasion domains, demonstrating consistent improvements over strong baselines.
arXiv Detail & Related papers (2025-09-22T07:53:59Z) - Thought-Augmented Planning for LLM-Powered Interactive Recommender Agent [56.61028117645315]
We propose a novel thought-augmented interactive recommender agent system (TAIRA) that addresses complex user intents through distilled thought patterns.<n>Specifically, TAIRA is designed as an LLM-powered multi-agent system featuring a manager agent that orchestrates recommendation tasks by decomposing user needs and planning subtasks.<n>Through comprehensive experiments conducted across multiple datasets, TAIRA exhibits significantly enhanced performance compared to existing methods.
arXiv Detail & Related papers (2025-06-30T03:15:50Z) - Enhancing Personalized Multi-Turn Dialogue with Curiosity Reward [11.495697919066341]
Policy agents must be able to personalize their behavior to suit a user's preferences, personality, and attributes.
Current training methods like Reinforcement Learning from Human Feedback (RLHF) prioritize helpfulness and safety but fall short in fostering truly empathetic, adaptive, and personalized interactions.
We propose to incorporate an intrinsic motivation to improve the conversational agents's model of the user as an additional reward alongside multi-turn RLHF.
arXiv Detail & Related papers (2025-04-04T06:35:02Z) - Towards Personalized Conversational Sales Agents : Contextual User Profiling for Strategic Action [12.637812936971049]
We introduce Conversational Sales (CSales), a novel task that unifies preference elicitation, recommendation, and persuasion.<n>For a realistic evaluation of CSales, we present CSUser, an LLM-based user simulator constructed from real-world data.<n>We also propose CSI, a conversational sales agent that proactively infers contextual profiles through dialogue for personalized action planning.
arXiv Detail & Related papers (2025-03-28T15:49:52Z) - Strength Lies in Differences! Improving Strategy Planning for Non-collaborative Dialogues via Diversified User Simulation [69.5677514160986]
We investigate non-collaborative dialogue agents, which are expected to engage in strategic conversations with diverse users.
This poses two main challenges for existing dialogue agents.
We propose Trip to enhance the capability in tailored strategic planning, incorporating a user-aware strategic planning module and a population-based training paradigm.
arXiv Detail & Related papers (2024-03-11T14:38:16Z) - Plug-and-Play Policy Planner for Large Language Model Powered Dialogue
Agents [121.46051697742608]
We introduce a new dialogue policy planning paradigm to strategize dialogue problems with a tunable language model plug-in named PPDPP.
Specifically, we develop a novel training framework to facilitate supervised fine-tuning over available human-annotated data.
PPDPP consistently and substantially outperforms existing approaches on three different proactive dialogue applications.
arXiv Detail & Related papers (2023-11-01T03:20:16Z) - "Think Before You Speak": Improving Multi-Action Dialog Policy by
Planning Single-Action Dialogs [33.78889030078026]
Multi-action dialog policy (MADP) generates multiple atomic dialog actions per turn.
We propose Planning Enhanced Dialog Policy (PEDP), a novel multi-task learning framework that learns single-action dialog dynamics.
Our fully supervised learning-based method achieves a solid task success rate of 90.6%, improving 3% compared to the state-of-the-art methods.
arXiv Detail & Related papers (2022-04-25T07:55:53Z) - Interacting with Non-Cooperative User: A New Paradigm for Proactive
Dialogue Policy [83.61404191470126]
We propose a new solution named I-Pro that can learn Proactive policy in the Interactive setting.
Specifically, we learn the trade-off via a learned goal weight, which consists of four factors.
The experimental results demonstrate I-Pro significantly outperforms baselines in terms of effectiveness and interpretability.
arXiv Detail & Related papers (2022-04-07T14:11:31Z) - User Satisfaction Estimation with Sequential Dialogue Act Modeling in
Goal-oriented Conversational Systems [65.88679683468143]
We propose a novel framework, namely USDA, to incorporate the sequential dynamics of dialogue acts for predicting user satisfaction.
USDA incorporates the sequential transitions of both content and act features in the dialogue to predict the user satisfaction.
Experimental results on four benchmark goal-oriented dialogue datasets show that the proposed method substantially and consistently outperforms existing methods on USE.
arXiv Detail & Related papers (2022-02-07T02:50:07Z) - What Does The User Want? Information Gain for Hierarchical Dialogue
Policy Optimisation [3.1433893853959605]
optimisation via reinforcement learning (RL) is susceptible to sample inefficiency and instability.
We propose the usage of an intrinsic reward based on information gain to address this issue.
Our algorithm, which we call FeudalGain, achieves state-of-the-art results in most environments of the PyDial framework.
arXiv Detail & Related papers (2021-09-15T07:21:26Z) - Optimizing Interactive Systems via Data-Driven Objectives [70.3578528542663]
We propose an approach that infers the objective directly from observed user interactions.
These inferences can be made regardless of prior knowledge and across different types of user behavior.
We introduce Interactive System (ISO), a novel algorithm that uses these inferred objectives for optimization.
arXiv Detail & Related papers (2020-06-19T20:49:14Z) - Learning Goal-oriented Dialogue Policy with Opposite Agent Awareness [116.804536884437]
We propose an opposite behavior aware framework for policy learning in goal-oriented dialogues.
We estimate the opposite agent's policy from its behavior and use this estimation to improve the target agent by regarding it as part of the target policy.
arXiv Detail & Related papers (2020-04-21T03:13:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.