Fugu-MT 論文翻訳(概要): HippoCamp: Benchmarking Contextual Agents on Personal Computers

論文の概要: HippoCamp: Benchmarking Contextual Agents on Personal Computers

arxiv url: http://arxiv.org/abs/2604.01221v1
Date: Wed, 01 Apr 2026 17:58:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-02 16:44:32.140057
Title: HippoCamp: Benchmarking Contextual Agents on Personal Computers
Title（参考訳）: HippoCamp: パソコン上のコンテキストエージェントのベンチマーク
Authors: Zhe Yang, Shulin Tian, Kairui Hu, Shuai Liu, Hoang-Nhat Nguyen, Yichi Zhang, Zujin Guo, Mengying Yu, Zinan Zhang, Jingkang Yang, Chen Change Loy, Ziwei Liu,
Abstract要約: HippoCampは、マルチモーダルファイル管理におけるエージェントの能力を評価するために設計された新しいベンチマークである。本ベンチマークでは,2K以上の実世界のファイルにまたがる42.4GBのデータを含む,多種多様なモダリティにまたがる実世界のプロファイルに対して,デバイススケールのファイルシステムをインスタンス化する。
参考スコア（独自算出の注目度）: 71.97629614361549
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management. Unlike existing agent benchmarks that focus on tasks like web interaction, tool use, or software automation in generic settings, HippoCamp evaluates agents in user-centric environments to model individual user profiles and search massive personal files for context-aware reasoning. Our benchmark instantiates device-scale file systems over real-world profiles spanning diverse modalities, comprising 42.4 GB of data across over 2K real-world files. Building upon the raw files, we construct 581 QA pairs to assess agents' capabilities in search, evidence perception, and multi-step reasoning. To facilitate fine-grained analysis, we provide 46.1K densely annotated structured trajectories for step-wise failure diagnosis. We evaluate a wide range of state-of-the-art multimodal large language models (MLLMs) and agentic methods on HippoCamp. Our comprehensive experiments reveal a significant performance gap: even the most advanced commercial models achieve only 48.3% accuracy in user profiling, struggling particularly with long-horizon retrieval and cross-modal reasoning within dense personal file systems. Furthermore, our step-wise failure diagnosis identifies multimodal perception and evidence grounding as the primary bottlenecks. Ultimately, HippoCamp exposes the critical limitations of current agents in realistic, user-centric environments and provides a robust foundation for developing next-generation personal AI assistants.
Abstract（参考訳）: マルチモーダルファイル管理におけるエージェントの能力を評価するための新しいベンチマークであるHippoCampを提案する。ウェブインタラクションやツール使用、一般的な設定におけるソフトウェア自動化といったタスクに焦点を当てた既存のエージェントベンチマークとは異なり、HippoCampは、ユーザ中心の環境でエージェントを評価して、個々のユーザプロファイルをモデル化し、コンテキスト対応の推論のために巨大なパーソナルファイルを検索する。本ベンチマークでは,2K以上の実世界のファイルにまたがる42.4GBのデータを含む,多種多様なモダリティにまたがる実世界のプロファイルに対して,デバイススケールのファイルシステムをインスタンス化する。原ファイルに基づいて,エージェントの検索能力,エビデンス知覚,多段階推論の能力を評価するために,511のQAペアを構築した。細粒度解析を容易にするため,ステップワイド故障診断のための46.1Kの高密度注釈付き構造軌道を提供する。我々は,HippoCamp上でのMLLMとエージェント手法を多種多様な最先端マルチモーダル言語モデル (MLLM) で評価した。最新の商用モデルでさえ、ユーザープロファイリングにおいて48.3%の精度しか達成していない。さらに, ステップワイド障害診断では, 主要なボトルネックとなるマルチモーダル認識と証拠が同定される。最終的にHippoCampは、現実的でユーザ中心の環境で、現在のエージェントの限界を露呈し、次世代のパーソナルAIアシスタントを開発するための堅牢な基盤を提供する。

論文の概要: HippoCamp: Benchmarking Contextual Agents on Personal Computers

関連論文リスト