Teaching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment
- URL: http://arxiv.org/abs/2505.15456v1
- Date: Wed, 21 May 2025 12:38:36 GMT
- Title: Teaching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment
- Authors: Weixiang Zhao, Xingyu Sui, Yulin Hu, Jiahe Guo, Haixiao Liu, Biye Li, Yanyan Zhao, Bing Qin, Ting Liu,
- Abstract summary: We introduce the Reinforcement Learning for Personalized Alignment (RLPA) framework to iteratively infer and refine user profiles through dialogue.<n>We instantiate RLPA by fine-tuning Qwen-2.5-3B-Instruct, resulting in Qwen-RLPA, which achieves state-of-the-art performance in personalized dialogue.
- Score: 35.68913976348608
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Personalized alignment is essential for enabling large language models (LLMs) to engage effectively in user-centric dialogue. While recent prompt-based and offline optimization methods offer preliminary solutions, they fall short in cold-start scenarios and long-term personalization due to their inherently static and shallow designs. In this work, we introduce the Reinforcement Learning for Personalized Alignment (RLPA) framework, in which an LLM interacts with a simulated user model to iteratively infer and refine user profiles through dialogue. The training process is guided by a dual-level reward structure: the Profile Reward encourages accurate construction of user representations, while the Response Reward incentivizes generation of responses consistent with the inferred profile. We instantiate RLPA by fine-tuning Qwen-2.5-3B-Instruct, resulting in Qwen-RLPA, which achieves state-of-the-art performance in personalized dialogue. Empirical evaluations demonstrate that Qwen-RLPA consistently outperforms prompting and offline fine-tuning baselines, and even surpasses advanced commercial models such as Claude-3.5 and GPT-4o. Further analysis highlights Qwen-RLPA's robustness in reconciling conflicting user preferences, sustaining long-term personalization and delivering more efficient inference compared to recent reasoning-focused LLMs. These results emphasize the potential of dynamic profile inference as a more effective paradigm for building personalized dialogue systems.
Related papers
- Synthetic Interaction Data for Scalable Personalization in Large Language Models [67.31884245564086]
We introduce a high-fidelity synthetic data generation framework called PersonaGym.<n>Unlike prior work that treats personalization as static persona-preference pairs, PersonaGym models a dynamic preference process.<n>We release PersonaAtlas, a large-scale, high-quality, and diverse synthetic dataset of high-fidelity multi-turn personalized interaction trajectories.
arXiv Detail & Related papers (2026-02-12T20:41:22Z) - PersoDPO: Scalable Preference Optimization for Instruction-Adherent, Persona-Grounded Dialogue via Multi-LLM Evaluation [20.228114552545772]
PersoDPO is a scalable preference optimisation framework.<n>It integrates evaluation metrics targeting coherence and personalization, along with a length-format compliance feature.<n>Experiments on the FoCus dataset show that an open-source language model fine-tuned with the PersoDPO framework consistently outperforms strong open-source baselines.
arXiv Detail & Related papers (2026-02-04T12:34:55Z) - One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment [55.86333374784959]
We argue that addressing these constraints requires a paradigm shift from fitting data to learn user preferences to learn the process of preference adaptation.<n>We propose Meta Reward Modeling (MRM), which reformulates personalized reward modeling as a meta-learning problem.<n>We show that MRM enhances few-shot personalization, improves user robustness, and consistently outperforms baselines.
arXiv Detail & Related papers (2026-01-26T17:55:52Z) - Profile-LLM: Dynamic Profile Optimization for Realistic Personality Expression in LLMs [11.672385046863655]
PersonaPulse is a framework that iteratively enhances role-play prompts while integrating a situational response benchmark as a scoring tool.<n> Quantitative evaluations demonstrate that the prompts generated by PersonaPulse outperform those of prior work.<n>For certain personality traits, the extent of personality evocation can be partially controlled by pausing the optimization process.
arXiv Detail & Related papers (2025-11-25T02:31:40Z) - Reflective Personalization Optimization: A Post-hoc Rewriting Framework for Black-Box Large Language Models [16.152962349146275]
We propose Reflective Personalization Optimization (RPO), a framework that redefines the personalization paradigm by decoupling content generation from alignment.<n>RPO operates in two distinct stages: first, a base model generates a high-quality, generic response; then, an external reflection module explicitly rewrites this output to align with the user's preferences.<n> Comprehensive experiments on the LaMP benchmark demonstrate that RPO, by decoupling content generation from personalization, significantly outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2025-11-07T14:48:49Z) - Learning from Natural Language Feedback for Personalized Question Answering [21.115495457454365]
Personalization is crucial for enhancing the effectiveness and user satisfaction of language technologies.<n>Current approaches for personalizing large language models (LLMs) often rely on retrieval-augmented generation (RAG)<n>We introduce Vac, a novel framework for personalized response generation that replaces scalar rewards with natural language feedback (NLF)
arXiv Detail & Related papers (2025-08-14T14:36:53Z) - Debate, Reflect, and Distill: Multi-Agent Feedback with Tree-Structured Preference Optimization for Efficient Language Model Enhancement [43.532921045069365]
Large Language Models (LLMs) continue to set new standards in knowledge-intensive and complex reasoning tasks.<n>Current techniques, such as static knowledge distillation, resource-intensive reinforcement learning from human feedback, or limited self-reflection to yield substantial and lasting performance gains.<n>We present a novel Reflect and Debate (D&R) framework that orchestrates multi-turn debates between smaller models and stronger teacher models, eliciting actionable feedback.
arXiv Detail & Related papers (2025-06-04T03:52:20Z) - NextQuill: Causal Preference Modeling for Enhancing LLM Personalization [82.15961484963256]
We introduce NextQuill, a novel personalization framework grounded in causal preference modeling.<n>Building on this insight, NextQuill introduces two complementary alignment strategies.<n> Experiments across multiple personalization benchmarks demonstrate that NextQuill significantly improves personalization quality.
arXiv Detail & Related papers (2025-06-03T02:08:55Z) - What Makes LLMs Effective Sequential Recommenders? A Study on Preference Intensity and Temporal Context [56.590259941275434]
RecPO is a preference optimization framework for sequential recommendation.<n>It exploits adaptive reward margins based on inferred preference hierarchies and temporal signals.<n>It mirrors key characteristics of human decision-making: favoring timely satisfaction, maintaining coherent preferences, and exercising discernment under shifting contexts.
arXiv Detail & Related papers (2025-06-02T21:09:29Z) - A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations [112.81207927088117]
PersonaConvBench is a benchmark for evaluating personalized reasoning and generation in multi-turn conversations with large language models (LLMs)<n>We benchmark several commercial and open-source LLMs under a unified prompting setup and observe that incorporating personalized history yields substantial performance improvements.
arXiv Detail & Related papers (2025-05-20T09:13:22Z) - Know You First and Be You Better: Modeling Human-Like User Simulators via Implicit Profiles [37.43150003866563]
User simulators are crucial for replicating human interactions with dialogue systems.<n>We propose User Simulator with implicit Profiles (USP), a framework that infers implicit user profiles from human-machine conversations.<n>USP outperforms strong baselines in terms of authenticity and diversity while achieving comparable performance in consistency.
arXiv Detail & Related papers (2025-02-26T09:26:54Z) - Disentangling Length Bias In Preference Learning Via Response-Conditioned Modeling [87.17041933863041]
Reinforcement Learning from Human Feedback (RLHF) has achieved considerable success in aligning large language models (LLMs)<n>We introduce a $textbfR$esponse-$textbfc$onditioned $textbfB$radley-$textbfT$erry (Rc-BT) model that enhances the model's capability in length bias mitigating and length instruction following.<n>We also propose the Rc-RM and Rc-DPO algorithm to leverage the Rc-BT model for reward modeling and direct policy optimization
arXiv Detail & Related papers (2025-02-02T14:50:25Z) - Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language Models [54.381650481255235]
We introduce a new tuning-free approach for self-alignment, Dynamic Rewarding with Prompt Optimization (O)
Our approach leverages a search-based optimization framework that allows LLMs to iteratively self-improve and craft the optimal alignment instructions.
Empirical evaluations on eight recent LLMs, both open and closed-sourced, demonstrate that DRPO significantly enhances alignment performance.
arXiv Detail & Related papers (2024-11-13T16:15:38Z) - PAD: Personalized Alignment of LLMs at Decoding-Time [10.347782385286582]
This paper presents a novel framework designed to align LLM outputs with diverse personalized preferences during the inference phase.<n>The Personalized Alignment at Decoding-time (PAD) framework decouples the text generation process from personalized preferences.<n>PAD not only outperforms existing training-based alignment methods in terms of aligning with diverse preferences but also shows significant generalizability to preferences unseen during training.
arXiv Detail & Related papers (2024-10-05T08:00:55Z) - TSO: Self-Training with Scaled Preference Optimization [14.3799656174528]
We propose TSO, a framework for preference optimization that conducts self-training preference learning without training an additional reward model.
TSO enhances the diversity of responses by constructing a model matrix and incorporating human preference responses.
Experimental results demonstrate that TSO outperforms existing mainstream methods on various alignment evaluation benchmarks.
arXiv Detail & Related papers (2024-08-31T05:37:01Z) - Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data.
Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z) - Aligning Large Language Models with Counterfactual DPO [1.8130068086063336]
This paper explores the utilization of counterfactual prompting to align the model's style without relying on human intervention.
We demonstrate that this method effectively instils desirable behaviour, mitigates undesirable ones, and encourages the model to disregard inappropriate instructions.
arXiv Detail & Related papers (2024-01-17T19:43:43Z) - Unlocking the Potential of User Feedback: Leveraging Large Language
Model as User Simulator to Enhance Dialogue System [65.93577256431125]
We propose an alternative approach called User-Guided Response Optimization (UGRO) to combine it with a smaller task-oriented dialogue model.
This approach uses LLM as annotation-free user simulator to assess dialogue responses, combining them with smaller fine-tuned end-to-end TOD models.
Our approach outperforms previous state-of-the-art (SOTA) results.
arXiv Detail & Related papers (2023-06-16T13:04:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.