Learning to summarize user information for personalized reinforcement learning from human feedback
- URL: http://arxiv.org/abs/2507.13579v2
- Date: Fri, 26 Sep 2025 20:32:18 GMT
- Title: Learning to summarize user information for personalized reinforcement learning from human feedback
- Authors: Hyunji Nam, Yanming Wan, Mickel Liu, Jianxun Lian, Peter Ahnn, Natasha Jaques,
- Abstract summary: Preference Learning Using Summarization (PLUS) uses reinforcement learning to learn to produce text-based summaries of each user's preferences.<n>Both the user-summarization model and reward model are trained simultaneously, creating an online co-adaptation loop.<n>We show that PLUS capture diverse aspects of user preferences, achieving a 11-77% improvement in reward model accuracy.
- Score: 19.859785715555013
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: As everyday use cases of large language model (LLM) AI assistants have expanded, it is becoming increasingly important to personalize responses to align to different users' preferences and goals. While reinforcement learning from human feedback (RLHF) is effective at improving LLMs to be generally more helpful and fluent, it does not account for variability across users, as it models the entire user population with a single reward model, meaning it assumes that everyone's preferences are the same. We present a novel framework, Preference Learning Using Summarization (PLUS), that uses reinforcement learning (RL) to learn to produce text-based summaries of each user's preferences, characteristics, and past conversations. These summaries condition the reward model, enabling it to make personalized predictions about the types of responses valued by each user. Both the user-summarization model and reward model are trained simultaneously, creating an online co-adaptation loop. We show that in contrast to the standard Bradley-Terry model, summaries produced by PLUS capture diverse aspects of user preferences, achieving a 11-77% improvement in reward model accuracy. Key strengths of PLUS are: (1) robust performance with new users and conversation topics, achieving a 25% improvement over the best personalized RLHF technique; (2) zero-shot personalization with state-of-the-art proprietary models like GPT-4 (e.g., PLUS-summary-conditioned responses achieved a 72% win rate compared to 28% for default GPT-4o); (3) learning from flexible user contexts beyond preference labels, and (4) interpretable representation of users, enabling greater transparency and user control in pluralistic LLM alignment.
Related papers
- One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment [55.86333374784959]
We argue that addressing these constraints requires a paradigm shift from fitting data to learn user preferences to learn the process of preference adaptation.<n>We propose Meta Reward Modeling (MRM), which reformulates personalized reward modeling as a meta-learning problem.<n>We show that MRM enhances few-shot personalization, improves user robustness, and consistently outperforms baselines.
arXiv Detail & Related papers (2026-01-26T17:55:52Z) - Aligning Large Language Models via Fully Self-Synthetic Data [20.05693955243206]
Traditional reinforcement learning from human feedback (RLHF) for large language models (LLMs) relies on expensive human-annotated datasets.<n>In this work, we introduce Self-Alignment Optimization (SAO), a fully self-synthetic framework for LLM alignment.<n>Experiments demonstrate that SAO effectively enhances the model's chat capabilities on standard benchmarks like AlpacaEval2.0.
arXiv Detail & Related papers (2025-10-08T05:07:45Z) - Learning from Natural Language Feedback for Personalized Question Answering [21.115495457454365]
Personalization is crucial for enhancing the effectiveness and user satisfaction of language technologies.<n>Current approaches for personalizing large language models (LLMs) often rely on retrieval-augmented generation (RAG)<n>We introduce Vac, a novel framework for personalized response generation that replaces scalar rewards with natural language feedback (NLF)
arXiv Detail & Related papers (2025-08-14T14:36:53Z) - User Feedback in Human-LLM Dialogues: A Lens to Understand Users But Noisy as a Learning Signal [58.43749783815486]
We study implicit user feedback in two user-LM interaction datasets.<n>We find that the contents of user feedback can improve model performance in short human-designed questions.<n>We also find that the usefulness of user feedback is largely tied to the quality of the user's initial prompt.
arXiv Detail & Related papers (2025-07-30T23:33:29Z) - LoRe: Personalizing LLMs via Low-Rank Reward Modeling [47.12507639759984]
We introduce a novel framework that leverages low-rank preference modeling to efficiently learn and generalize user-specific reward functions.<n>We validate our method on multiple preference datasets, demonstrating superior generalization to unseen users and improved accuracy in preference prediction tasks.
arXiv Detail & Related papers (2025-04-20T01:16:24Z) - Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale [51.9706400130481]
Large Language Models (LLMs) have emerged as personalized assistants for users across a wide range of tasks.<n> PERSONAMEM features curated user profiles with over 180 simulated user-LLM interaction histories.<n>We evaluate LLM chatbots' ability to identify the most suitable response according to the current state of the user's profile.
arXiv Detail & Related papers (2025-04-19T08:16:10Z) - Enhancing Personalized Multi-Turn Dialogue with Curiosity Reward [11.495697919066341]
We propose leveraging a user model to incorporate a curiosity-based intrinsic reward into multi-turn RLHF.<n>This novel reward mechanism encourages the LLM agent to actively infer user traits by optimizing conversations to improve its user model's accuracy.<n>We demonstrate our method's effectiveness in two distinct domains: significantly improving personalization performance in a conversational recommendation task, and personalizing conversations for different learning styles in an educational setting.
arXiv Detail & Related papers (2025-04-04T06:35:02Z) - Language Model Personalization via Reward Factorization [38.30745045315918]
We introduce a framework that extends RLHF to enable user personalization.<n>We represent user-specific rewards as a linear combination of base reward functions.<n>In human evaluations, our method achieves a 67% win rate over default GPT-4o responses.
arXiv Detail & Related papers (2025-03-08T23:41:20Z) - Rehearse With User: Personalized Opinion Summarization via Role-Playing based on Large Language Models [29.870187698924852]
Large language models face difficulties in personalized tasks involving long texts.<n>Having the model act as the user, the model can better understand the user's personalized needs.<n>Our method can effectively improve the level of personalization in large model-generated summaries.
arXiv Detail & Related papers (2025-03-01T11:05:01Z) - FSPO: Few-Shot Preference Optimization of Synthetic Preference Data in LLMs Elicits Effective Personalization to Real Users [111.56469697145519]
We propose Few-Shot Preference Optimization, which reframes reward modeling as a meta-learning problem.<n>Under this framework, an LLM learns to quickly adapt to a user via a few labeled preferences from that user, constructing a personalized reward function for them.<n>We generate over 1M synthetic personalized preferences using publicly available LLMs.<n>We evaluate FSPO on personalized open-ended generation for up to 1,500 synthetic users across three domains: movie reviews, pedagogical adaptation based on educational background, and general question answering, along with a controlled human study.
arXiv Detail & Related papers (2025-02-26T17:08:46Z) - UQABench: Evaluating User Embedding for Prompting LLMs in Personalized Question Answering [39.79275025010785]
name is a benchmark designed to evaluate the effectiveness of user embeddings in prompting large language models for personalization.<n>We conduct extensive experiments on various state-of-the-art methods for modeling user embeddings.
arXiv Detail & Related papers (2025-02-26T14:34:00Z) - Lifelong Personalized Low-Rank Adaptation of Large Language Models for Recommendation [50.837277466987345]
We focus on the field of large language models (LLMs) for recommendation.
We propose RecLoRA, which incorporates a Personalized LoRA module that maintains independent LoRAs for different users.
We also design a Few2Many Learning Strategy, using a conventional recommendation model as a lens to magnify small training spaces to full spaces.
arXiv Detail & Related papers (2024-08-07T04:20:28Z) - Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback [110.16220825629749]
Learning from preference feedback has emerged as an essential step for improving the generation quality and performance of modern language models.
In this work, we identify four core aspects of preference-based learning: preference data, learning algorithm, reward model, and policy training prompts.
Our findings indicate that all aspects are important for performance, with better preference data leading to the largest improvements.
arXiv Detail & Related papers (2024-06-13T16:17:21Z) - Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts [95.09994361995389]
Relative Preference Optimization (RPO) is designed to discern between more and less preferred responses derived from both identical and related prompts.
RPO has demonstrated a superior ability to align large language models with user preferences and to improve their adaptability during the training process.
arXiv Detail & Related papers (2024-02-12T22:47:57Z) - Personalized Language Modeling from Personalized Human Feedback [45.16986573937782]
Personalized large language models (LLMs) are designed to tailor responses to individual user preferences.<n>We propose Personalized-RLHF, an efficient framework that utilizes a lightweight user model to capture individual user preferences.<n>We show that personalized LLMs trained using P-RLHF generate responses that are more closely aligned with individual user preferences.
arXiv Detail & Related papers (2024-02-06T04:18:58Z) - User Embedding Model for Personalized Language Prompting [9.472634942498859]
We introduce a new User Embedding Module (UEM) that efficiently processes user history in free-form text by compressing and representing them as embeddings.
Our experiments demonstrate the superior capability of this approach in handling significantly longer histories.
The main contribution of this research is to demonstrate the ability to bias language models with user signals represented as embeddings.
arXiv Detail & Related papers (2024-01-10T00:35:52Z) - RecExplainer: Aligning Large Language Models for Explaining Recommendation Models [50.74181089742969]
Large language models (LLMs) have demonstrated remarkable intelligence in understanding, reasoning, and instruction following.
This paper presents the initial exploration of using LLMs as surrogate models to explain black-box recommender models.
To facilitate an effective alignment, we introduce three methods: behavior alignment, intention alignment, and hybrid alignment.
arXiv Detail & Related papers (2023-11-18T03:05:43Z) - COLA: Improving Conversational Recommender Systems by Collaborative
Augmentation [9.99763097964222]
We propose a collaborative augmentation (COLA) method to improve both item representation learning and user preference modeling.
We construct an interactive user-item graph from all conversations, which augments item representations with user-aware information.
To improve user preference modeling, we retrieve similar conversations from the training corpus, where the involved items and attributes that reflect the user's potential interests are used to augment the user representation.
arXiv Detail & Related papers (2022-12-15T12:37:28Z) - Personalized Reward Learning with Interaction-Grounded Learning (IGL) [7.898208662809734]
Modern recommender systems typically optimize for the same fixed combination of implicit feedback signals across all users.
We propose applying the recent Interaction Grounded Learning paradigm to address the challenge of learning representations of diverse user communication modalities.
arXiv Detail & Related papers (2022-11-28T23:18:10Z) - Unsupervised Neural Stylistic Text Generation using Transfer learning
and Adapters [66.17039929803933]
We propose a novel transfer learning framework which updates only $0.3%$ of model parameters to learn style specific attributes for response generation.
We learn style specific attributes from the PERSONALITY-CAPTIONS dataset.
arXiv Detail & Related papers (2022-10-07T00:09:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.