Related papers: Beyond Static Evaluation: Rethinking the Assessment of Personalized Agent Adaptability in Information Retrieval

Beyond Static Evaluation: Rethinking the Assessment of Personalized Agent Adaptability in Information Retrieval

URL: http://arxiv.org/abs/2510.03984v1
Date: Sun, 05 Oct 2025 00:35:37 GMT
Title: Beyond Static Evaluation: Rethinking the Assessment of Personalized Agent Adaptability in Information Retrieval
Authors: Kirandeep Kaur, Preetam Prabhu Srikar Dammu, Hideo Joho, Chirag Shah,
Abstract summary: We propose a conceptual lens for rethinking evaluation in adaptive personalization.<n>We organize this lens around three core components: (1) persona-based user simulation with temporally evolving preference models; (2) structured elicitation protocols inspired by reference interviews to extract preferences in context; and (3) adaptation-aware evaluation mechanisms that measure how agent behavior improves across sessions and tasks.
Score: 12.058221341033835
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Personalized AI agents are becoming central to modern information retrieval, yet most evaluation methodologies remain static, relying on fixed benchmarks and one-off metrics that fail to reflect how users' needs evolve over time. These limitations hinder our ability to assess whether agents can meaningfully adapt to individuals across dynamic, longitudinal interactions. In this perspective paper, we propose a conceptual lens for rethinking evaluation in adaptive personalization, shifting the focus from static performance snapshots to interaction-aware, evolving assessments. We organize this lens around three core components: (1) persona-based user simulation with temporally evolving preference models; (2) structured elicitation protocols inspired by reference interviews to extract preferences in context; and (3) adaptation-aware evaluation mechanisms that measure how agent behavior improves across sessions and tasks. While recent works have embraced LLM-driven user simulation, we situate this practice within a broader paradigm for evaluating agents over time. To illustrate our ideas, we conduct a case study in e-commerce search using the PersonalWAB dataset. Beyond presenting a framework, our work lays a conceptual foundation for understanding and evaluating personalization as a continuous, user-centric endeavor.

Related papers

Learning Personalized Agents from Human Feedback [36.47803872623135]
We introduce Personalized Agents from Human Feedback (PAHF), a framework for continual personalization.<n>PAHF learns online from live interaction using explicit per-user memory.<n> benchmarks quantify an agent's ability to learn initial preferences from scratch and subsequently adapt to persona shifts.
arXiv Detail & Related papers (2026-02-18T04:18:47Z)
Enabling Personalized Long-term Interactions in LLM-based Agents through Persistent Memory and User Profiles [0.4885400580268118]
Large language models (LLMs) increasingly serve as the central control unit of AI agents.<n>We present a framework that integrates persistent memory, dynamic coordination, self-validation, and evolving user profiles to enable personalized long-term interactions.
arXiv Detail & Related papers (2025-10-09T08:22:16Z)
STARec: An Efficient Agent Framework for Recommender Systems via Autonomous Deliberate Reasoning [54.28691219536054]
We introduce STARec, a slow-thinking augmented agent framework that endows recommender systems with autonomous deliberative reasoning capabilities.<n>We develop anchored reinforcement training - a two-stage paradigm combining structured knowledge distillation from advanced reasoning models with preference-aligned reward shaping.<n>Experiments on MovieLens 1M and Amazon CDs benchmarks demonstrate that STARec achieves substantial performance gains compared with state-of-the-art baselines.
arXiv Detail & Related papers (2025-08-26T08:47:58Z)
PersonaAgent: When Large Language Model Agents Meet Personalization at Test Time [87.99027488664282]
PersonaAgent is a framework designed to address versatile personalization tasks.<n>It integrates a personalized memory module and a personalized action module.<n>Test-time user-preference alignment strategy ensures real-time user preference alignment.
arXiv Detail & Related papers (2025-06-06T17:29:49Z)
EOC-Bench: Can MLLMs Identify, Recall, and Forecast Objects in an Egocentric World? [52.99661576320663]
multimodal large language models (MLLMs) have driven breakthroughs in egocentric vision applications.<n>EOC-Bench is an innovative benchmark designed to systematically evaluate object-centric embodied cognition in dynamic egocentric scenarios.<n>We conduct comprehensive evaluations of various proprietary, open-source, and object-level MLLMs based on EOC-Bench.
arXiv Detail & Related papers (2025-06-05T17:44:12Z)
What Makes LLMs Effective Sequential Recommenders? A Study on Preference Intensity and Temporal Context [56.590259941275434]
RecPO is a preference optimization framework for sequential recommendation.<n>It exploits adaptive reward margins based on inferred preference hierarchies and temporal signals.<n>It mirrors key characteristics of human decision-making: favoring timely satisfaction, maintaining coherent preferences, and exercising discernment under shifting contexts.
arXiv Detail & Related papers (2025-06-02T21:09:29Z)
From Rankings to Insights: Evaluation Should Shift Focus from Leaderboard to Feedback [36.68929551237421]
We introduce bftextFeedbacker, an evaluation framework that provides comprehensive and fine-grained results.<n>Our project homepage and dataset are available at https://liudan193.io/Feedbacker.
arXiv Detail & Related papers (2025-05-10T16:52:40Z)
On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows [71.92083784393418]
Agentic AI (systems that autonomously plan and act) are becoming widespread, yet their task success rate on complex tasks remains low.<n>Inference-time alignment relies on three components: sampling, evaluation, and feedback.<n>We introduce Iterative Agent Decoding (IAD), a procedure that repeatedly inserts feedback extracted from different forms of critiques.
arXiv Detail & Related papers (2025-04-02T17:40:47Z)
Survey on Evaluation of LLM-based Agents [28.91672694491855]
The emergence of LLM-based agents represents a paradigm shift in AI.<n>This paper provides the first comprehensive survey of evaluation methodologies for these increasingly capable agents.
arXiv Detail & Related papers (2025-03-20T17:59:23Z)
Dynamic Evaluation Framework for Personalized and Trustworthy Agents: A Multi-Session Approach to Preference Adaptability [10.443994990138973]
We argue for a paradigm shift in evaluating personalized and adaptive agents.<n>We propose a comprehensive novel framework that models user personas with unique attributes and preferences.<n>Our flexible framework is designed to support a variety of agents and applications, ensuring a comprehensive and versatile evaluation of recommendation strategies.
arXiv Detail & Related papers (2025-03-08T22:50:26Z)
Dynamic benchmarking framework for LLM-based conversational data capture [0.0]
This paper introduces a benchmarking framework to assess large language models (LLMs)<n>It integrates generative agent simulation to evaluate performance on key dimensions: information extraction, context awareness, and adaptive engagement.<n>Results show that adaptive strategies improve data extraction accuracy, especially when handling ambiguous responses.
arXiv Detail & Related papers (2025-02-04T15:47:47Z)
TestAgent: Automatic Benchmarking and Exploratory Interaction for Evaluating LLMs in Vertical Domains [19.492393243160244]
Large Language Models (LLMs) are increasingly deployed in highly specialized vertical domains.<n>Existing evaluations for vertical domains typically rely on the labor-intensive construction of static single-turn datasets.<n>We propose TestAgent, a framework for automatic benchmarking and exploratory dynamic evaluation in vertical domains.
arXiv Detail & Related papers (2024-10-15T11:20:42Z)
Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment. To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.