Fugu-MT 論文翻訳(概要): Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding

論文の概要: Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding

arxiv url: http://arxiv.org/abs/2505.18079v1
Date: Fri, 23 May 2025 16:37:36 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-26 18:08:34.223978
Title: Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding
Title（参考訳）: ディープビデオ発見:ロングフォームビデオ理解ツールによるエージェント検索
Authors: Xiaoyi Zhang, Zhaoyang Jia, Zongyu Guo, Jiahao Li, Bin Li, Houqiang Li, Yan Lu,
Abstract要約: 長時間の映像理解は時間空間の複雑さによって大きな課題を呈する。セグメント化されたビデオクリップ上でのエージェント検索戦略を活用するために,Deep Video Discoveryエージェントを提案する。我々のDVDエージェントはSOTA性能を達成し,LVBenchデータセットの先行処理をはるかに上回っている。
参考スコア（独自算出の注目度）: 63.82450803014141
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity and the difficulty of question answering under such extended contexts. While Large Language Models (LLMs) have demonstrated considerable advancements in video analysis capabilities and long context handling, they continue to exhibit limitations when processing information-dense hour-long videos. To overcome such limitations, we propose the Deep Video Discovery agent to leverage an agentic search strategy over segmented video clips. Different from previous video agents manually designing a rigid workflow, our approach emphasizes the autonomous nature of agents. By providing a set of search-centric tools on multi-granular video database, our DVD agent leverages the advanced reasoning capability of LLM to plan on its current observation state, strategically selects tools, formulates appropriate parameters for actions, and iteratively refines its internal reasoning in light of the gathered information. We perform comprehensive evaluation on multiple long video understanding benchmarks that demonstrates the advantage of the entire system design. Our DVD agent achieves SOTA performance, significantly surpassing prior works by a large margin on the challenging LVBench dataset. Comprehensive ablation studies and in-depth tool analyses are also provided, yielding insights to further advance intelligent agents tailored for long-form video understanding tasks. The code will be released later.
Abstract（参考訳）: 長時間のビデオ理解は、時間空間の複雑さと、そのような拡張された文脈下での質問応答の難しさにより、重大な課題を呈する。 LLM(Large Language Models)は、ビデオ分析機能と長時間のコンテキストハンドリングの大幅な進歩を示したが、情報量の多い1時間ビデオの処理には限界を示し続けている。このような制限を克服するために,セグメント化されたビデオクリップ上でのエージェント検索戦略を活用するディープビデオディスカバリーエージェントを提案する。従来のビデオエージェントが手動で複雑なワークフローを設計するのとは異なり、我々のアプローチはエージェントの自律性を強調している。マルチグラニュラビデオデータベースに検索中心のツールセットを提供することにより,我々のDVDエージェントは,LCMの高度な推論能力を活用して現在の観測状態を計画し,ツールを戦略的に選択し,アクションの適切なパラメータを定式化し,収集された情報に基づいて内部の推論を反復的に洗練する。システム設計全体の利点を示す複数の長いビデオ理解ベンチマークを総合的に評価する。我々のDVDエージェントはSOTA性能を達成し,LVBenchデータセットの先行処理をはるかに上回っている。包括的アブレーション研究や詳細なツール分析も提供され、長期的なビデオ理解タスクに適した、より進んだインテリジェントエージェントに対する洞察が得られている。コードは後でリリースされる。

関連論文リスト

Infinite Video Understanding [50.78256932424239]
Infinite Video Understandingをブルースキー研究の目的とするフレーミングは、マルチメディアにとって重要な北の星となると我々は主張する。我々は、この変革能力を達成するための主要な課題と研究の方向性を概説する。
論文参考訳（メタデータ） (2025-07-11T23:07:04Z)
VideoAgent2: Enhancing the LLM-Based Agent System for Long-Form Video Understanding by Uncertainty-Aware CoT [31.413204839972984]
本稿では,長時間のビデオ解析に適した特別なチェーン・オブ・シント(CoT)プロセスを提案する。我々の不確実性を認識したCoTは、外部ツールからのノイズを効果的に軽減し、より信頼性の高い出力を生み出します。我々は、一般的なコンテキスト取得や特殊なツール設計などの追加モジュールを含むVideoAgent2というシステムで、我々のアプローチを実装している。
論文参考訳（メタデータ） (2025-04-06T13:03:34Z)
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM [81.15525024145697]
ビデオ大言語モデル (Video Large Language Models, ビデオLLM) は近年, 一般的なビデオ理解において顕著な能力を示した。しかし、それらは主に全体論的理解に焦点を当て、きめ細かい空間的・時間的詳細を捉えるのに苦労している。我々は,高精細度空間時間映像理解のためのビデオLLMを実現するために,VideoRefer Suiteを導入した。
論文参考訳（メタデータ） (2024-12-31T18:56:46Z)
SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis [52.050036778325094]
本稿では,SALOVA: Segment-Augmented Video Assistantを紹介する。 87.8Kビデオの高品質なコレクションをセグメントレベルで高密度にキャプションし、シーンの連続性を捕捉し、リッチなコンテキストを維持する。本フレームワークは,クエリに応答して,関連ビデオセグメントの正確な識別と検索を可能にすることで,現在のビデオLMMの限界を緩和する。
論文参考訳（メタデータ） (2024-11-25T08:04:47Z)
OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer [14.503628667535425]
広範なビデオの処理は、膨大なデータと処理要求のために大きな課題をもたらします。我々はOmAgentを開発し、特定のクエリの関連ビデオフレームを効率的に保存し、検索する。自律推論が可能なDivide-and-Conquer Loopを備えている。より高度な自律性と堅牢なツールコールシステムを備えており、さらに複雑なタスクを達成できます。
論文参考訳（メタデータ） (2024-06-24T13:05:39Z)
How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs [98.37571997794072]
CVRR-ES(Complex Video Reasoning and Robustness Evaluation Suite)について紹介する。 CVRR-ESは、11種類の実世界のビデオ次元にわたるビデオLMMの性能を包括的に評価する。我々の発見は、次世代の人間中心AIシステムを構築する上で貴重な洞察を提供する。
論文参考訳（メタデータ） (2024-05-06T17:59:45Z)
MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding [69.04413943858584]
長文映画の質問応答データセットであるMoVQAを紹介する。マルチモーダルシステムの多様な認知能力を評価するためのベンチマークも行った。
論文参考訳（メタデータ） (2023-12-08T03:33:38Z)
Query-aware Long Video Localization and Relation Discrimination for Deep Video Understanding [15.697251303126874]
Deep Video Understanding (DVU) Challengeは、マルチモーダル抽出、融合、分析の境界を推し進めることを目的としている。本稿では,画像言語事前学習モデルを利用して,長時間の動画のローカライゼーションと関係の識別を行うクエリアウェア手法を提案する。本手法は,映画レベルの問合せの2つのグループにおいて,第1位と第4位を達成した。
論文参考訳（メタデータ） (2023-10-19T13:26:02Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。