Learning from Natural Language Feedback for Personalized Question Answering
- URL: http://arxiv.org/abs/2508.10695v1
- Date: Thu, 14 Aug 2025 14:36:53 GMT
- Title: Learning from Natural Language Feedback for Personalized Question Answering
- Authors: Alireza Salemi, Hamed Zamani,
- Abstract summary: Personalization is crucial for enhancing the effectiveness and user satisfaction of language technologies.<n>Current approaches for personalizing large language models (LLMs) often rely on retrieval-augmented generation (RAG)<n>We introduce Vac, a novel framework for personalized response generation that replaces scalar rewards with natural language feedback (NLF)
- Score: 21.115495457454365
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Personalization is crucial for enhancing both the effectiveness and user satisfaction of language technologies, particularly in information-seeking tasks like question answering. Current approaches for personalizing large language models (LLMs) often rely on retrieval-augmented generation (RAG), followed by reinforcement learning with scalar reward signals to teach models how to use retrieved personal context. We believe that these scalar rewards sometimes provide weak, non-instructive feedback, limiting learning efficiency and personalization quality. We introduce VAC, a novel framework for personalized response generation that replaces scalar rewards with natural language feedback (NLF) that are generated conditioned on the user profiles and the question narratives. NLF serves as a rich and actionable supervision signal, allowing the policy model to iteratively refine its outputs and internalize effective personalization strategies. Training alternates between optimizing the feedback model and fine-tuning the policy model on the improved responses, resulting in a policy model that no longer requires feedback at inference. Evaluation on the LaMP-QA benchmark that consists of three diverse domains demonstrates consistent and significant improvements over the state-of-the-art results. Human evaluations further confirm the superior quality of the generated responses. These results demonstrate that NLF provides more effective signals for optimizing personalized question answering.
Related papers
- Post-Training Large Language Models via Reinforcement Learning from Self-Feedback [3.73824942136665]
Large Language Models (LLMs) often produce plausible but poorly-calibrated answers.<n>We present Reinforcement Learning from Self-Feedback (RLSF), a post-training stage that uses the model's own confidence as an intrinsic reward.
arXiv Detail & Related papers (2025-07-29T15:46:26Z) - Teaching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment [35.68913976348608]
We introduce the Reinforcement Learning for Personalized Alignment (RLPA) framework to iteratively infer and refine user profiles through dialogue.<n>We instantiate RLPA by fine-tuning Qwen-2.5-3B-Instruct, resulting in Qwen-RLPA, which achieves state-of-the-art performance in personalized dialogue.
arXiv Detail & Related papers (2025-05-21T12:38:36Z) - Self-Improvement in Language Models: The Sharpening Mechanism [70.9248553790022]
We offer a new perspective on the capabilities of self-improvement through a lens we refer to as sharpening.<n>Motivated by the observation that language models are often better at verifying response quality than they are at generating correct responses, we formalize self-improvement as using the model itself as a verifier during post-training.<n>We analyze two natural families of self-improvement algorithms based on SFT and RLHF.
arXiv Detail & Related papers (2024-12-02T20:24:17Z) - Aligning Language Models Using Follow-up Likelihood as Reward Signal [40.388526412214276]
We propose using the likelihood of follow-up utterances as rewards to differentiate preferred responses from less favored ones.<n>Our proposed reward mechanism, Follow-up Likelihood as Reward" (FLR), matches the performance of strong reward models trained on large-scale human or GPT-4 annotated data.
arXiv Detail & Related papers (2024-09-20T23:47:25Z) - Aligning Large Language Models from Self-Reference AI Feedback with one General Principle [61.105703857868775]
We propose a self-reference-based AI feedback framework that enables a 13B Llama2-Chat to provide high-quality feedback.
Specifically, we allow the AI to first respond to the user's instructions, then generate criticism of other answers based on its own response as a reference.
Finally, we determine which answer better fits human preferences according to the criticism.
arXiv Detail & Related papers (2024-06-17T03:51:46Z) - Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs [25.011675414622392]
This study introduces a novel approach to enhance the reward model's generalization ability against distribution shifts.
We retain the base model's language model head and incorporate a suite of text-generation losses to preserve the hidden states' text-generation capabilities.
Our experimental results demonstrate that the introduced regularization technique markedly improves the accuracy of learned reward models.
arXiv Detail & Related papers (2024-06-14T17:49:59Z) - Direct Alignment of Language Models via Quality-Aware Self-Refinement [31.845241241178982]
We investigate the use of intrinsic knowledge within the on-the-fly fine-tuning LLM to obtain relative qualities and help to refine the loss function.
We show that the constructed refinement function can help self-refine the loss function under mild assumptions.
Experiments indicate that they can improve the performance of the fine-tuned models over DPO and IPO.
arXiv Detail & Related papers (2024-05-31T17:31:18Z) - Improving the Validity of Automatically Generated Feedback via Reinforcement Learning [46.667783153759636]
We propose a framework for feedback generation that optimize both correctness and alignment using reinforcement learning (RL)<n>Specifically, we use GPT-4's annotations to create preferences over feedback pairs in an augmented dataset for training via direct preference optimization (DPO)
arXiv Detail & Related papers (2024-03-02T20:25:50Z) - Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset.
We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z) - Enabling Language Models to Implicitly Learn Self-Improvement [49.16868302881804]
Large Language Models (LLMs) have demonstrated remarkable capabilities in open-ended text generation tasks.
We propose an ImPlicit Self-ImprovemenT (PIT) framework that implicitly learns the improvement goal from human preference data.
arXiv Detail & Related papers (2023-10-02T04:29:40Z) - Exploring Large Language Model for Graph Data Understanding in Online
Job Recommendations [63.19448893196642]
We present a novel framework that harnesses the rich contextual information and semantic representations provided by large language models to analyze behavior graphs.
By leveraging this capability, our framework enables personalized and accurate job recommendations for individual users.
arXiv Detail & Related papers (2023-07-10T11:29:41Z) - Generative Adversarial Reward Learning for Generalized Behavior Tendency
Inference [71.11416263370823]
We propose a generative inverse reinforcement learning for user behavioral preference modelling.
Our model can automatically learn the rewards from user's actions based on discriminative actor-critic network and Wasserstein GAN.
arXiv Detail & Related papers (2021-05-03T13:14:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.