VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions
Abstract Overview
VitaBench 2.0 is a benchmark for evaluating personalized and proactive agent behavior in long-term user interactions. It organizes tasks as temporally ordered sequences for individual users across multiple domains, with user preferences embedded in fragmented dialogue and behavior histories rather than stated explicitly. The benchmark also includes an extensible memory interface so different memory architectures can be compared under controlled conditions. Across a broad set of proprietary and open models, the study finds that current agents still struggle to infer, maintain, and apply evolving user preferences in realistic settings.
Novelty
The paper’s main novelty is a benchmark that jointly evaluates personalization, proactive information seeking, and long-term agent behavior in executable task environments rather than passive text-only settings. It also introduces a controlled memory interface to compare agentic memory and RAG-style memory for personalized decision-making over time.
Results
Experiments show that real-world personalization remains difficult even for frontier models, with performance staying limited even when full interaction history is available. Memory is important but often hurts performance relative to full-context access, and enabling reasoning or thinking modes does not consistently improve personalization. The analysis further shows that proactive tasks are harder than standard personalization tasks and that preference-related failures are the dominant bottleneck.
Key Points
- VitaBench 2.0 evaluates agents on long-term, user-centric task sequences where preferences must be extracted, updated, and used from fragmented interactions.
- The benchmark covers 56 users, over 2,000 manually curated preferences, three domains, and 66 tools, with support for both agentic and RAG-based memory settings.
- Empirical results indicate that personalization, not tool use alone, is the main limiting factor for current agents, especially under long interaction histories and proactive decision requirements.
References
- arXiv: https://arxiv.org/abs/2605.27141v1
- Fugu-MT: https://fugumt.com/fugumt/paper_check/2605.27141v1
- Hugging Face Papers: https://huggingface.co/papers/2605.27141