Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering
- URL: http://arxiv.org/abs/2602.19317v1
- Date: Sun, 22 Feb 2026 19:43:43 GMT
- Title: Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering
- Authors: Maryam Amirizaniani, Alireza Salemi, Hamed Zamani,
- Abstract summary: Personalization in Question Answering (QA) requires answers that are both accurate and aligned with users' background, preferences, and historical context.<n>We propose PR2, a reinforcement learning framework that integrates reasoning and retrieval from personal context for personalization.
- Score: 39.08300602619814
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Personalization in Question Answering (QA) requires answers that are both accurate and aligned with users' background, preferences, and historical context. Existing state-of-the-art methods primarily rely on retrieval-augmented generation (RAG) solutions that construct personal context by retrieving relevant items from the user's profile. Existing methods use the user's query directly to retrieve personal documents, and such strategies often lead to surface-level personalization. We propose PR2 (Personalized Retrieval-Augmented Reasoning), a reinforcement learning framework that integrates reasoning and retrieval from personal context for personalization. PR2 learns adaptive retrieval-reasoning policies, determining when to retrieve, what evidence to retrieve from user profiles, and how to incorporate it into intermediate reasoning steps. By optimizing multi-turn reasoning trajectories under a personalized reward function, the framework reinforces reasoning paths that better align with user-specific preferences and contextual signals reflected by the reward model. Extensive experiments on the LaMP-QA benchmark using three LLMs show that PR2 consistently outperforms strong baselines, achieving an average relative improvement of 8.8%-12% in personalized QA.
Related papers
- Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions [50.70965714314064]
Large Language Models (LLMs) are increasingly serving as personal assistants, where users share complex and diverse preferences over extended interactions.<n>This work proposes RealPref, a benchmark for evaluating realistic preference-following in personalized user-LLM interactions.
arXiv Detail & Related papers (2026-03-04T15:42:43Z) - Optimizing User Profiles via Contextual Bandits for Retrieval-Augmented LLM Personalization [27.490675380289318]
We argue that relevance serves as an unreliable proxy for utility.<n>We propose PURPLE, a contextual bandit framework that oPtimizes UseR Profiles for Llm pErsonalization.<n>In contrast to a greedy selection of the most relevant records, PURPLE treats profile construction as a set generation process.
arXiv Detail & Related papers (2026-01-17T15:05:36Z) - Personalize Before Retrieve: LLM-based Personalized Query Expansion for User-Centric Retrieval [34.298743064665395]
Personalize Before Retrieve (PBR) is a framework that incorporates user-specific signals into query expansion prior to retrieval.<n>PBR consistently outperforms strong baselines, with up to 10% gains on PersonaBench across retrievers.
arXiv Detail & Related papers (2025-10-10T02:24:09Z) - Personalized Reasoning: Just-In-Time Personalization and Why LLMs Fail At It [81.50711040539566]
Current large language model (LLM) development treats task-solving and preference alignment as separate challenges.<n>We introduce PREFDISCO, an evaluation methodology that transforms static benchmarks into interactive personalization tasks.<n>Our framework creates scenarios where identical questions require different reasoning chains depending on user context.
arXiv Detail & Related papers (2025-09-30T18:55:28Z) - Pathways of Thoughts: Multi-Directional Thinking for Long-form Personalized Question Answering [57.12316804290369]
Personalization is essential for adapting question answering systems to user-specific information needs.<n>We propose Pathways of Thoughts (PoT), an inference-stage method that applies to any large language model (LLM) without requiring task-specific fine-tuning.<n>PoT consistently outperforms competitive baselines, achieving up to a 13.1% relative improvement.
arXiv Detail & Related papers (2025-09-23T14:44:46Z) - PrLM: Learning Explicit Reasoning for Personalized RAG via Contrastive Reward Optimization [4.624026598342624]
We propose PrLM, a reinforcement learning framework that trains LLMs to explicitly reason over retrieved user profiles.<n>PrLM effectively learns from user responses without requiring annotated reasoning paths.<n>Experiments on three personalized text generation datasets show that PrLM outperforms existing methods.
arXiv Detail & Related papers (2025-08-10T13:37:26Z) - A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations [112.81207927088117]
PersonaConvBench is a benchmark for evaluating personalized reasoning and generation in multi-turn conversations with large language models (LLMs)<n>We benchmark several commercial and open-source LLMs under a unified prompting setup and observe that incorporating personalized history yields substantial performance improvements.
arXiv Detail & Related papers (2025-05-20T09:13:22Z) - Reinforcing Compositional Retrieval: Retrieving Step-by-Step for Composing Informative Contexts [67.67746334493302]
Large Language Models (LLMs) have demonstrated remarkable capabilities across numerous tasks, yet they often rely on external context to handle complex tasks.<n>We propose a tri-encoder sequential retriever that models this process as a Markov Decision Process (MDP)<n>We show that our method consistently and significantly outperforms baselines, underscoring the importance of explicitly modeling inter-example dependencies.
arXiv Detail & Related papers (2025-04-15T17:35:56Z) - From Guessing to Asking: An Approach to Resolving the Persona Knowledge Gap in LLMs during Multi-Turn Conversations [11.958380211411386]
This study introduces the persona knowledge gap, the discrepancy between a model's internal understanding and the knowledge required for coherent, personalized conversations.<n>We propose Conversation Preference Elicitation and Recommendation (CPER), a novel framework that dynamically detects and resolves persona knowledge gaps.<n>CPER consists of three key modules: a Contextual Understanding Module for preference extraction, a Dynamic Feedback Module for measuring uncertainty and refining persona alignment, and a Persona-Driven Response Generation module for adapting responses based on accumulated user context.
arXiv Detail & Related papers (2025-03-16T15:55:29Z) - Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts [95.09994361995389]
Relative Preference Optimization (RPO) is designed to discern between more and less preferred responses derived from both identical and related prompts.
RPO has demonstrated a superior ability to align large language models with user preferences and to improve their adaptability during the training process.
arXiv Detail & Related papers (2024-02-12T22:47:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.