Fugu-MT 論文翻訳(概要): VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

論文の概要: VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

arxiv url: http://arxiv.org/abs/2605.27141v1
Date: Tue, 26 May 2026 15:07:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-27 17:51:42.36353
Title: VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions
Title（参考訳）: VitaBench 2.0: 長期ユーザインタラクションにおける個人化エージェントとプロアクティブエージェントの評価
Authors: Yuxin Chen, Yi Zhang, Zhengzhou Cai, Yaorui Shi, Zhiyuan Yao, Chenhang Cui, Jingnan Zheng, Yaqi Huo, Xi Su, Qi Gu, Xunliang Cai, Xiang Wang, An Zhang, Tat-Seng Chua,
Abstract要約: 我々は、長期ユーザーインタラクションにおけるパーソナライズされたプロアクティブなエージェント動作を評価するためのベンチマークであるVitaBench 2.0を紹介する。結果は、最先端のモデルでさえ、現実世界のパーソナライゼーションは非常に困難であることを示している。
参考スコア（独自算出の注目度）: 63.13827503828231
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have evolved into interactive agents that collaborate with users in real-world tasks. Effective collaboration in such settings increasingly depends on understanding the user beyond what is explicitly stated, as user intent is often reflected in fragmented daily interactions and requires both personalized modeling and proactive interaction. However, existing agent benchmarks primarily evaluate reasoning and tool use, largely overlooking the challenges of inferring and leveraging user preferences in realistic scenarios. To address this gap, we introduce VitaBench 2.0, a benchmark for evaluating personalized and proactive agent behavior in long-term user interactions. In VitaBench 2.0, tasks are organized as temporally ordered sequences for individual users, where preferences are embedded in fragmented and heterogeneous interactions. Successful completion of tasks requires the agent to continuously extract, utilize, and update user preferences from these interactions. We further evaluate proactiveness through tasks that require agents to recognize missing information and actively acquire it from users or environments before making decisions. To support systematic analysis, we provide an extensible memory interface that enables controlled comparison across different memory architectures. We benchmark a diverse set of frontier proprietary and open-source LLMs. Results show that real-world personalization remains highly challenging even for state-of-the-art models, revealing a substantial gap between current capabilities and practical requirements. Extensive analysis further reveals the failure modes and capability bottlenecks of current agents in real-world personalized decision-making, providing insights for future model improvements.
Abstract（参考訳）: 大規模言語モデル(LLM)は、現実世界のタスクでユーザと協調する対話型エージェントへと進化してきた。ユーザの意図は、しばしば断片化された日々のインタラクションに反映され、パーソナライズされたモデリングとプロアクティブなインタラクションの両方を必要とする。しかし、既存のエージェントベンチマークは主に推論とツールの使用を評価し、現実的なシナリオでユーザーの好みを推論し活用するという課題を主に見落としている。このギャップに対処するために、長期ユーザーインタラクションにおけるパーソナライズされたプロアクティブなエージェントの挙動を評価するベンチマークであるVitaBench 2.0を紹介する。 VitaBench 2.0では、タスクは個々のユーザの時間順のシーケンスとして整理される。タスクの完了に成功するためには、エージェントがこれらのインタラクションからユーザの好みを継続的に抽出し、利用し、更新する必要がある。我々は、エージェントが行方不明情報を認識し、意思決定を行う前にユーザーや環境から積極的に情報を取得することを要求するタスクを通じて、さらに積極性を評価する。システム解析を支援するため,異なるメモリアーキテクチャ間の比較を制御可能な拡張可能なメモリインタフェースを提供する。我々は、さまざまなフロンティアのプロプライエタリおよびオープンソース LLM をベンチマークする。その結果、現状のモデルであっても現実のパーソナライゼーションは極めて困難であり、現在の能力と実践的な要件の間に大きなギャップがあることが判明した。大規模な分析により、現実のパーソナライズされた意思決定において、現在のエージェントの障害モードと能力ボトルネックが明らかになり、将来のモデル改善の洞察が得られます。

論文の概要: VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

関連論文リスト