Fugu-MT 論文翻訳(概要): HiCrew: Hierarchical Reasoning for Long-Form Video Understanding via Question-Aware Multi-Agent Collaboration

論文の概要: HiCrew: Hierarchical Reasoning for Long-Form Video Understanding via Question-Aware Multi-Agent Collaboration

arxiv url: http://arxiv.org/abs/2604.21444v1
Date: Thu, 23 Apr 2026 09:04:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-24 14:40:06.396556
Title: HiCrew: Hierarchical Reasoning for Long-Form Video Understanding via Question-Aware Multi-Agent Collaboration
Title（参考訳）: HiCrew: 質問対応マルチエージェントコラボレーションによる長時間のビデオ理解のための階層的推論
Authors: Yuehan Zhu, Jingqi Zhao, Jiawen Zhao, Xudong Mao, Baoquan Zhao,
Abstract要約: 3つのコアコントリビューションを通じて制限に対処する階層型マルチエージェントフレームワークであるHiCrewを紹介する。まず,時間的トポロジを保存し,関連性を考慮した階層クラスタリングを行うハイブリッドツリー構造を提案する。第2に,意図駆動型視覚的プロンプトを合成して意味記述を生成する質問認識キャプション機構を開発する。
参考スコア（独自算出の注目度）: 9.907651803712803
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Long-form video understanding remains fundamentally challenged by pervasive spatiotemporal redundancy and intricate narrative dependencies that span extended temporal horizons. While recent structured representations compress visual information effectively, they frequently sacrifice temporal coherence, which is critical for causal reasoning. Meanwhile, existing multi-agent frameworks operate through rigid, pre-defined workflows that fail to adapt their reasoning strategies to question-specific demands. In this paper, we introduce HiCrew, a hierarchical multi-agent framework that addresses these limitations through three core contributions. First, we propose a Hybrid Tree structure that leverages shot boundary detection to preserve temporal topology while performing relevance-guided hierarchical clustering within semantically coherent segments. Second, we develop a Question-Aware Captioning mechanism that synthesizes intent-driven visual prompts to generate precision-oriented semantic descriptions. Third, we integrate a Planning Layer that dynamically orchestrates agent collaboration by adaptively selecting roles and execution paths based on question complexity. Extensive experiments on EgoSchema and NExT-QA validate the effectiveness of our approach, demonstrating strong performance across diverse question types with particularly pronounced gains in temporal and causal reasoning tasks that benefit from our hierarchical structure-preserving design.
Abstract（参考訳）: 長期的ビデオ理解は、広範に時空間的冗長性と、時間的地平線にまたがる複雑な物語的依存関係によって、根本的な課題が残されている。最近の構造化された表現は視覚情報を効果的に圧縮するが、因果推論にとって重要な時間的コヒーレンスをしばしば犠牲にする。一方、既存のマルチエージェントフレームワークは、厳格で定義されたワークフローを通じて動作し、推論戦略を問題固有の要求に適応できない。本稿では,3つのコアコントリビューションを通じてこれらの制限に対処する階層型マルチエージェントフレームワークであるHiCrewを紹介する。まず、ショット境界検出を利用して時間的トポロジを保存し、意味的コヒーレントセグメント内で関連性誘導階層クラスタリングを行うハイブリッドツリー構造を提案する。次に、意図駆動型視覚プロンプトを合成し、精度指向のセマンティック記述を生成する質問認識キャプション機構を開発する。第3に,質問複雑性に基づいて役割や実行経路を適応的に選択することで,エージェントの協調を動的にオーケストレーションするプランニング層を統合する。 EgoSchema と NExT-QA の広範囲にわたる実験により, 階層構造保存設計の恩恵を受ける時間的・因果的推論タスクにおいて, 多様な質問型に対して高い性能を示した。

論文の概要: HiCrew: Hierarchical Reasoning for Long-Form Video Understanding via Question-Aware Multi-Agent Collaboration

関連論文リスト