Fugu-MT 論文翻訳(概要): KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

論文の概要: KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

arxiv url: http://arxiv.org/abs/2604.08455v1
Date: Thu, 09 Apr 2026 16:50:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-10 18:34:06.031679
Title: KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation
Title（参考訳）: KnowU-Bench: 対話的で、積極的に、パーソナライズされたモバイルエージェント評価を目指す
Authors: Tongbo Chen, Zhengxi Lu, Zhan Xu, Guocheng Shao, Shaohan Zhao, Fei Tang, Yong Du, Kaitao Song, Yizhou Liu, Yuchen Yan, Wenqi Zhang, Xu Tan, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen,
Abstract要約: KnowU-Benchはパーソナライズされたモバイルエージェントのためのオンラインベンチマークである。 42のGUIタスク、86のパーソナライズされたタスク、64のプロアクティブタスクをカバーしている。明示的なタスク実行に優れるエージェントは、あいまいな指示の下で50%以下に低下する。
参考スコア（独自算出の注目度）: 72.01173512175531
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Personalized mobile agents that infer user preferences and calibrate proactive assistance hold great promise as everyday digital assistants, yet existing benchmarks fail to capture what this requires. Prior work evaluates preference recovery from static histories or intent prediction from fixed contexts. Neither tests whether an agent can elicit missing preferences through interaction, nor whether it can decide when to intervene, seek consent, or remain silent in a live GUI environment. We introduce KnowU-Bench, an online benchmark for personalized mobile agents built on a reproducible Android emulation environment, covering 42 general GUI tasks, 86 personalized tasks, and 64 proactive tasks. Unlike prior work that treats user preferences as static context, KnowU-Bench hides the user profile from the agent and exposes only behavioral logs, forcing genuine preference inference rather than context lookup. To support multi-turn preference elicitation, it instantiates an LLM-driven user simulator grounded in structured profiles, enabling realistic clarification dialogues and proactive consent handling. Beyond personalization, KnowU-Bench provides comprehensive evaluation of the complete proactive decision chain, including grounded GUI execution, consent negotiation, and post-rejection restraint, evaluated through a hybrid protocol combining rule-based verification with LLM-as-a-Judge scoring. Our experiments reveal a striking degradation: agents that excel at explicit task execution fall below 50% under vague instructions requiring user preference inference or intervention calibration, even for frontier models like Claude Sonnet 4.6. The core bottlenecks are not GUI navigation but preference acquisition and intervention calibration, exposing a fundamental gap between competent interface operation and trustworthy personal assistance.
Abstract（参考訳）: ユーザの好みを推測し、プロアクティブなアシストを調整するパーソナライズされたモバイルエージェントは、日常的なデジタルアシスタントとして大きな可能性を秘めている。以前の作業では、静的ヒストリーからの好みの回復や、固定されたコンテキストからの意図の予測が評価されていた。エージェントが相互作用を通じて欠落した嗜好を引き出すことができるか、いつ介入するか、同意を求めるか、あるいはライブGUI環境で沈黙し続けるかは検査されない。我々は、再現可能なAndroidエミュレーション環境上に構築されたパーソナライズされたモバイルエージェントのためのオンラインベンチマークであるKnowU-Benchを紹介し、42の一般的なGUIタスク、86のパーソナライズされたタスク、64のプロアクティブタスクをカバーしている。ユーザの好みを静的なコンテキストとして扱う以前の作業とは異なり、KnowU-Benchはエージェントからユーザプロファイルを隠蔽し、振る舞いログのみを公開する。マルチターン選好推論をサポートするため、構造化プロファイルを基盤としたLCM駆動のユーザシミュレータをインスタンス化し、現実的な明確化対話と積極的同意処理を可能にする。パーソナライゼーション以外にも、LLM-as-a-Judgeスコアとルールベースの検証を組み合わせたハイブリッドプロトコルを通じて評価されたGUI実行、同意交渉、拒絶後抑制を含む、完全なプロアクティブな決定連鎖の包括的な評価を提供する。 Claude Sonnet 4.6のようなフロンティアモデルであっても、ユーザの好みの推測や介入のキャリブレーションを必要とするあいまいな指示の下で、明示的なタスク実行に優れるエージェントは50%以下になる。主なボトルネックはGUIナビゲーションではなく、好みの取得と介入のキャリブレーションであり、有能なインターフェース操作と信頼できる個人支援の間には根本的なギャップが生じる。

論文の概要: KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

関連論文リスト