LaMP-QA: A Benchmark for Personalized Long-form Question Answering
- URL: http://arxiv.org/abs/2506.00137v2
- Date: Sat, 20 Sep 2025 14:37:31 GMT
- Title: LaMP-QA: A Benchmark for Personalized Long-form Question Answering
- Authors: Alireza Salemi, Hamed Zamani,
- Abstract summary: We introduce LaMP-QA -- a benchmark designed for evaluating personalized long-form answer generation.<n>The benchmark covers questions from three major categories: (1) Arts & Entertainment, (2) Lifestyle & Personal Development, and (3) Society & Culture, encompassing over 45 subcategories in total.<n>Our results show that incorporating the personalized context provided leads to up to 39% performance improvements.
- Score: 37.909483957959715
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Personalization is essential for question answering systems that are user-centric. Despite its importance, personalization in answer generation has been relatively underexplored. This is mainly due to lack of resources for training and evaluating personalized question answering systems. We address this gap by introducing LaMP-QA -- a benchmark designed for evaluating personalized long-form answer generation. The benchmark covers questions from three major categories: (1) Arts & Entertainment, (2) Lifestyle & Personal Development, and (3) Society & Culture, encompassing over 45 subcategories in total. To assess the quality and potential impact of the LaMP-QA benchmark for personalized question answering, we conduct comprehensive human and automatic evaluations, to compare multiple evaluation strategies for evaluating generated personalized responses and measure their alignment with human preferences. Furthermore, we benchmark a number of non-personalized and personalized approaches based on open-source and proprietary large language models. Our results show that incorporating the personalized context provided leads to up to 39% performance improvements. The benchmark is publicly released to support future research in this area.
Related papers
- Towards Personalized Deep Research: Benchmarks and Evaluations [56.581105664044436]
We introduce Personalized Deep Research Bench, the first benchmark for evaluating personalization in Deep Research Agents (DRAs)<n>It pairs 50 diverse research tasks with 25 authentic user profiles that combine structured persona attributes with dynamic real-world contexts, yielding 250 realistic user-task queries.<n>Our experiments on a range of systems highlight current capabilities and limitations in handling personalized deep research.
arXiv Detail & Related papers (2025-09-29T17:39:17Z) - Pathways of Thoughts: Multi-Directional Thinking for Long-form Personalized Question Answering [57.12316804290369]
Personalization is essential for adapting question answering systems to user-specific information needs.<n>We propose Pathways of Thoughts (PoT), an inference-stage method that applies to any large language model (LLM) without requiring task-specific fine-tuning.<n>PoT consistently outperforms competitive baselines, achieving up to a 13.1% relative improvement.
arXiv Detail & Related papers (2025-09-23T14:44:46Z) - PersonaFeedback: A Large-scale Human-annotated Benchmark For Personalization [25.45861816665351]
We introduce PersonaFeedback, a new benchmark that directly evaluates LLMs' ability to provide personalized responses.<n>Unlike existing benchmarks that require models to infer implicit user personas from historical interactions, PersonaFeedback decouples persona inference from personalization.<n> PersonaFeedback consists of 8298 human-annotated test cases, which are categorized into easy, medium, and hard tiers.
arXiv Detail & Related papers (2025-06-15T17:19:19Z) - Social Bias in Popular Question-Answering Benchmarks [0.0]
Question-answering (QA) and reading comprehension (RC) benchmarks are essential for assessing the capabilities of large language models (LLMs) in retrieving and reproducing knowledge.<n>We demonstrate that popular QA and RC benchmarks are biased and do not cover questions about different demographics or regions in a representative way.
arXiv Detail & Related papers (2025-05-21T14:14:47Z) - A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations [112.81207927088117]
PersonaConvBench is a benchmark for evaluating personalized reasoning and generation in multi-turn conversations with large language models (LLMs)<n>We benchmark several commercial and open-source LLMs under a unified prompting setup and observe that incorporating personalized history yields substantial performance improvements.
arXiv Detail & Related papers (2025-05-20T09:13:22Z) - From Guessing to Asking: An Approach to Resolving the Persona Knowledge Gap in LLMs during Multi-Turn Conversations [11.958380211411386]
This study introduces the persona knowledge gap, the discrepancy between a model's internal understanding and the knowledge required for coherent, personalized conversations.<n>We propose Conversation Preference Elicitation and Recommendation (CPER), a novel framework that dynamically detects and resolves persona knowledge gaps.<n>CPER consists of three key modules: a Contextual Understanding Module for preference extraction, a Dynamic Feedback Module for measuring uncertainty and refining persona alignment, and a Persona-Driven Response Generation module for adapting responses based on accumulated user context.
arXiv Detail & Related papers (2025-03-16T15:55:29Z) - HREF: Human Response-Guided Evaluation of Instruction Following in Language Models [61.273153125847166]
We develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF)<n>In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination.<n>We study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template.
arXiv Detail & Related papers (2024-12-20T03:26:47Z) - Do RAG Systems Cover What Matters? Evaluating and Optimizing Responses with Sub-Question Coverage [74.70255719194819]
We introduce a novel framework based on sub-question coverage, which measures how well a RAG system addresses different facets of a question.
We use this framework to evaluate three commercial generative answer engines: You.com, Perplexity AI, and Bing Chat.
We find that while all answer engines cover core sub-questions more often than background or follow-up ones, they still miss around 50% of core sub-questions.
arXiv Detail & Related papers (2024-10-20T22:59:34Z) - PersoBench: Benchmarking Personalized Response Generation in Large Language Models [6.8046587254152735]
We present a new benchmark, PersoBench, to evaluate the personalization ability of large language models (LLMs) in persona-aware dialogue generation.
Our analysis, conducted on three well-known persona-aware datasets, evaluates multiple dimensions of response quality, including fluency, diversity, coherence, and personalization.
arXiv Detail & Related papers (2024-10-04T07:29:41Z) - Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions [25.158868133182025]
We present a method for evaluating the output of generative large language models (LLMs)<n>We use ranking models trained on annotated document collections as a substitute for explicit relevance.<n>In a user study, our method correlates with the preferences of a human expert.
arXiv Detail & Related papers (2024-08-19T09:27:45Z) - Towards Personalized Evaluation of Large Language Models with An
Anonymous Crowd-Sourcing Platform [64.76104135495576]
We propose a novel anonymous crowd-sourcing evaluation platform, BingJian, for large language models.
Through this platform, users have the opportunity to submit their questions, testing the models on a personalized and potentially broader range of capabilities.
arXiv Detail & Related papers (2024-03-13T07:31:20Z) - PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models [72.57329554067195]
ProxyQA is an innovative framework dedicated to assessing longtext generation.
It comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers.
It assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions.
arXiv Detail & Related papers (2024-01-26T18:12:25Z) - A Critical Evaluation of Evaluations for Long-form Question Answering [48.51361567469683]
Long-form question answering (LFQA) enables answering a wide range of questions, but its flexibility poses enormous challenges for evaluation.
We perform the first targeted study of the evaluation of long-form answers, covering both human and automatic evaluation practices.
arXiv Detail & Related papers (2023-05-29T16:54:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.