FuguReport

VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

Authors Yuxin Chen, Yi Zhang, Zhengzhou Cai, Yaorui Shi, Zhiyuan Yao, Chenhang Cui, Jingnan Zheng, Yaqi Huo, Xi Su, Qi Gu, Xunliang Cai, Xiang Wang, An Zhang, Tat-Seng Chua
Affiliations University of Science and Technology of China / Meituan / National University of Singapore / Beijing University of Posts and Telecommunications / Zhejiang University
Categories Evaluation / Agent Evaluation / Evaluation of proactive personalized agents, Task / User Interaction / Long-term interaction tasks, Application / Personalization / Real-world personalization challenges
License CC BY 4.0

Abstract Overview

VitaBench 2.0 is a benchmark for evaluating personalized and proactive agent behavior in long-term user interactions. It organizes tasks as temporally ordered sequences for individual users across multiple domains, with user preferences embedded in fragmented dialogue and behavior histories rather than stated explicitly. The benchmark also includes an extensible memory interface so different memory architectures can be compared under controlled conditions. Across a broad set of proprietary and open models, the study finds that current agents still struggle to infer, maintain, and apply evolving user preferences in realistic settings.

Novelty

The paper’s main novelty is a benchmark that jointly evaluates personalization, proactive information seeking, and long-term agent behavior in executable task environments rather than passive text-only settings. It also introduces a controlled memory interface to compare agentic memory and RAG-style memory for personalized decision-making over time.

Results

Experiments show that real-world personalization remains difficult even for frontier models, with performance staying limited even when full interaction history is available. Memory is important but often hurts performance relative to full-context access, and enabling reasoning or thinking modes does not consistently improve personalization. The analysis further shows that proactive tasks are harder than standard personalization tasks and that preference-related failures are the dominant bottleneck.

Key Points

  1. VitaBench 2.0 evaluates agents on long-term, user-centric task sequences where preferences must be extracted, updated, and used from fragmented interactions.
  2. The benchmark covers 56 users, over 2,000 manually curated preferences, three domains, and 66 tools, with support for both agentic and RAG-based memory settings.
  3. Empirical results indicate that personalization, not tool use alone, is the main limiting factor for current agents, especially under long interaction histories and proactive decision requirements.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.