Related papers: BESPOKE: Benchmark for Search-Augmented Large Language Model Personalization via Diagnostic Feedback

BESPOKE: Benchmark for Search-Augmented Large Language Model Personalization via Diagnostic Feedback

URL: http://arxiv.org/abs/2509.21106v1
Date: Thu, 25 Sep 2025 12:53:07 GMT
Title: BESPOKE: Benchmark for Search-Augmented Large Language Model Personalization via Diagnostic Feedback
Authors: Hyunseo Kim, Sangam Lee, Kwangwook Seo, Dongha Lee,
Abstract summary: We propose BESPOKE, the realistic benchmark for evaluating personalization in search-augmented large language models.<n>BESPOKE is designed to be both realistic, by collecting authentic chat and search histories directly from humans.<n>We conduct systematic analyses that reveal key requirements for effective personalization in information-seeking tasks.
Score: 9.980170820190093
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Search-augmented large language models (LLMs) have advanced information-seeking tasks by integrating retrieval into generation, reducing users' cognitive burden compared to traditional search systems. Yet they remain insufficient for fully addressing diverse user needs, which requires recognizing how the same query can reflect different intents across users and delivering information in preferred forms. While recent systems such as ChatGPT and Gemini attempt personalization by leveraging user histories, systematic evaluation of such personalization is under-explored. To address this gap, we propose BESPOKE, the realistic benchmark for evaluating personalization in search-augmented LLMs. BESPOKE is designed to be both realistic, by collecting authentic chat and search histories directly from humans, and diagnostic, by pairing responses with fine-grained preference scores and feedback. The benchmark is constructed through long-term, deeply engaged human annotation, where human annotators contributed their own histories, authored queries with detailed information needs, and evaluated responses with scores and diagnostic feedback. Leveraging BESPOKE, we conduct systematic analyses that reveal key requirements for effective personalization in information-seeking tasks, providing a foundation for fine-grained evaluation of personalized search-augmented LLMs. Our code and data are available at https://augustinlib.github.io/BESPOKE/.

Related papers

GISA: A Benchmark for General Information-Seeking Assistant [102.30831921333755]
GISA is a benchmark for General Information-Seeking Assistants comprising 373 human-crafted queries.<n>It integrates both deep reasoning and broad information aggregation within unified tasks, and includes a live subset with periodically updated answers to resist memorization.<n>Experiments on mainstream LLMs and commercial search products reveal that even the best-performing model achieves only 19.30% exact match score.
arXiv Detail & Related papers (2026-02-09T11:44:15Z)
Mem-PAL: Towards Memory-based Personalized Dialogue Assistants for Long-term User-Agent Interaction [55.24448139349266]
We present PAL-Bench, a new benchmark designed to evaluate the personalization capabilities of service-oriented assistants in long-term user-agent interactions.<n>To improve personalized service-oriented interactions, we propose H$2$Memory, a hierarchical and heterogeneous memory framework.
arXiv Detail & Related papers (2025-11-17T14:22:32Z)
Personalized Reasoning: Just-In-Time Personalization and Why LLMs Fail At It [81.50711040539566]
Current large language model (LLM) development treats task-solving and preference alignment as separate challenges.<n>We introduce PREFDISCO, an evaluation methodology that transforms static benchmarks into interactive personalization tasks.<n>Our framework creates scenarios where identical questions require different reasoning chains depending on user context.
arXiv Detail & Related papers (2025-09-30T18:55:28Z)
A Generative Framework for Personalized Sticker Retrieval [73.57899194210141]
We propose PEARL, a novel generative framework for personalized sticker retrieval.<n>We make two key contributions: (i) To encode user-specific sticker preferences, we design a representation learning model to learn discriminative user representations, and (ii) To generate stickers aligned with a user's query intent, we propose a novel intent-aware learning objective.<n> Empirical results from both offline evaluations and online tests demonstrate that PEARL significantly outperforms state-of-the-art methods.
arXiv Detail & Related papers (2025-09-22T13:11:44Z)
PersonaFeedback: A Large-scale Human-annotated Benchmark For Personalization [25.45861816665351]
We introduce PersonaFeedback, a new benchmark that directly evaluates LLMs' ability to provide personalized responses.<n>Unlike existing benchmarks that require models to infer implicit user personas from historical interactions, PersonaFeedback decouples persona inference from personalization.<n> PersonaFeedback consists of 8298 human-annotated test cases, which are categorized into easy, medium, and hard tiers.
arXiv Detail & Related papers (2025-06-15T17:19:19Z)
LLM-Driven Usefulness Judgment for Web Search Evaluation [12.10711284043516]
Evaluation is fundamental in optimizing search experiences and supporting diverse user intents in Information Retrieval (IR)<n>Traditional search evaluation methods primarily rely on relevance labels, which assess how well retrieved documents match a user's query.<n>In this paper, we explore an alternative approach: LLM-generated usefulness labels, which incorporate both implicit and explicit user behavior signals to evaluate document usefulness.
arXiv Detail & Related papers (2025-04-19T20:38:09Z)
A Survey of Personalized Large Language Models: Progress and Future Directions [86.45576419251302]
Large Language Models (LLMs) excel in handling general knowledge tasks, yet struggle with user-specific personalization.<n> Personalized Large Language Models (PLLMs) tackle these challenges by leveraging individual user data.<n>PLLMs can significantly enhance user satisfaction and have broad applications in conversational agents, systems, emotion recognition, medical assistants, and more.
arXiv Detail & Related papers (2025-02-17T07:58:31Z)
STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases [93.96463520716759]
We develop STARK, a large-scale Semi-structure retrieval benchmark on Textual and Knowledge Bases. Our benchmark covers three domains: product search, academic paper search, and queries in precision medicine. We design a novel pipeline to synthesize realistic user queries that integrate diverse relational information and complex textual properties.
arXiv Detail & Related papers (2024-04-19T22:54:54Z)
Persona-DB: Efficient Large Language Model Personalization for Response Prediction with Collaborative Data Refinement [79.2400720115588]
We introduce Persona-DB, a simple yet effective framework consisting of a hierarchical construction process to improve generalization across task contexts.<n>In the evaluation of response prediction, Persona-DB demonstrates superior context efficiency in maintaining accuracy with a significantly reduced retrieval size.<n>Our experiments also indicate a marked improvement of over 10% under cold-start scenarios, when users have extremely sparse data.
arXiv Detail & Related papers (2024-02-16T20:20:43Z)
Knowledge-Augmented Large Language Models for Personalized Contextual Query Suggestion [16.563311988191636]
We construct an entity-centric knowledge store for each user based on their search and browsing activities on the web. This knowledge store is light-weight, since it only produces user-specific aggregate projections of interests and knowledge onto public knowledge graphs.
arXiv Detail & Related papers (2023-11-10T01:18:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.