Fugu-MT 論文翻訳(概要): ProAct: A Benchmark and Multimodal Framework for Structure-Aware Proactive Response

論文の概要: ProAct: A Benchmark and Multimodal Framework for Structure-Aware Proactive Response

arxiv url: http://arxiv.org/abs/2602.03430v2
Date: Wed, 04 Feb 2026 03:41:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:41.136248
Title: ProAct: A Benchmark and Multimodal Framework for Structure-Aware Proactive Response
Title（参考訳）: ProAct: 構造認識型プロアクティブ応答のためのベンチマークとマルチモーダルフレームワーク
Authors: Xiaomeng Zhu, Fengming Zhu, Weijie Zhou, Ye Tian, Zhenlin Hu, Yufei Huang, Yuchun Guo, Xinyu Wu, Zhengyou Zhang, Fangzhen Lin, Xuantang Xiong,
Abstract要約: ProAct-75は、さまざまなドメインにわたるプロアクティブエージェントのトレーニングと評価のために設計されたベンチマークである。我々のデータセットは、明示的なタスクグラフに富んだ91,581のステップレベルのアノテーションを備えている。 MLLM(Multimodal Large Language Model)を用いた参照ベースラインであるProAct-Helperを提案する。
参考スコア（独自算出の注目度）: 20.913342340957904
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While passive agents merely follow instructions, proactive agents align with higher-level objectives, such as assistance and safety by continuously monitoring the environment to determine when and how to act. However, developing proactive agents is hindered by the lack of specialized resources. To address this, we introduce ProAct-75, a benchmark designed to train and evaluate proactive agents across diverse domains, including assistance, maintenance, and safety monitoring. Spanning 75 tasks, our dataset features 91,581 step-level annotations enriched with explicit task graphs. These graphs encode step dependencies and parallel execution possibilities, providing the structural grounding necessary for complex decision-making. Building on this benchmark, we propose ProAct-Helper, a reference baseline powered by a Multimodal Large Language Model (MLLM) that grounds decision-making in state detection, and leveraging task graphs to enable entropy-driven heuristic search for action selection, allowing agents to execute parallel threads independently rather than mirroring the human's next step. Extensive experiments demonstrate that ProAct-Helper outperforms strong closed-source models, improving trigger detection mF1 by 6.21%, saving 0.25 more steps in online one-step decision, and increasing the rate of parallel actions by 15.58%.
Abstract（参考訳）: 受動的エージェントは単に指示に従うだけであるが、プロアクティブエージェントは、環境を継続的に監視し、いつ、どのように行動するかを決定することで、援助や安全といったより高度な目的と整合する。しかし、プロアクティブエージェントの開発は、専門資源の不足によって妨げられている。 ProAct-75は,支援,メンテナンス,安全監視など,さまざまな領域にわたるプロアクティブエージェントのトレーニングと評価を目的としたベンチマークである。 75のタスクに、明示的なタスクグラフに富んだ91,581のステップレベルのアノテーションが特徴です。これらのグラフはステップ依存性と並列実行可能性をエンコードし、複雑な意思決定に必要な構造的基盤を提供します。本稿では,マルチモーダル大規模言語モデル (MLLM) をベースとした参照ベースラインであるProAct-Helperを提案する。これは状態検出における意思決定を基盤とし,タスクグラフを活用して,エントロピー駆動型ヒューリスティック検索によるアクション選択を実現し,エージェントが次のステップを反映するのではなく,並列スレッドを独立に実行できるようにする。大規模な実験により、ProAct-Helperは強力なクローズドソースモデルより優れ、トリガー検出mF1が6.21%向上し、オンラインワンステップ決定における0.25以上のステップを節約し、並列アクションの速度が15.58%向上した。

論文の概要: ProAct: A Benchmark and Multimodal Framework for Structure-Aware Proactive Response

関連論文リスト