Fugu-MT 論文翻訳(概要): CAViAR: Critic-Augmented Video Agentic Reasoning

論文の概要: CAViAR: Critic-Augmented Video Agentic Reasoning

arxiv url: http://arxiv.org/abs/2509.07680v1
Date: Tue, 09 Sep 2025 17:59:39 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-10 14:38:27.313071
Title: CAViAR: Critic-Augmented Video Agentic Reasoning
Title（参考訳）: CAViAR: 批判的なビデオエージェント推論
Authors: Sachit Menon, Ahmet Iscen, Arsha Nagrani, Tobias Weyand, Carl Vondrick, Cordelia Schmid,
Abstract要約: より複雑なビデオ推論を行うために、知覚能力を利用することができますか? 我々は,ビデオモジュールをサブエージェントやツールとして利用できる大規模言語モデルエージェントを開発した。我々は,我々のエージェントと批評家の組み合わせが,データセット上で高い性能を達成することを示す。
参考スコア（独自算出の注目度）: 90.48729440775223
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video understanding has seen significant progress in recent years, with models' performance on perception from short clips continuing to rise. Yet, multiple recent benchmarks, such as LVBench, Neptune, and ActivityNet-RTL, show performance wanes for tasks requiring complex reasoning on videos as queries grow more complex and videos grow longer. In this work, we ask: can existing perception capabilities be leveraged to successfully perform more complex video reasoning? In particular, we develop a large language model agent given access to video modules as subagents or tools. Rather than following a fixed procedure to solve queries as in previous work such as Visual Programming, ViperGPT, and MoReVQA, the agent uses the results of each call to a module to determine subsequent steps. Inspired by work in the textual reasoning domain, we introduce a critic to distinguish between instances of successful and unsuccessful sequences from the agent. We show that the combination of our agent and critic achieve strong performance on the previously-mentioned datasets.
Abstract（参考訳）: ビデオ理解は近年顕著な進歩を遂げており、ショートクリップからの知覚に対するモデルの性能は上昇し続けている。しかし、LVBench、Neptune、ActivityNet-RTLといった最近のベンチマークでは、クエリが複雑になり、ビデオが長くなるにつれて、複雑な推論を必要とするタスクのパフォーマンスが低下している。既存の知覚能力を利用して、より複雑なビデオ推論を成功させることができるか? 特に,ビデオモジュールをサブエージェントやツールとして利用できる大規模言語モデルエージェントを開発した。 Visual Programming、ViperGPT、MoReVQAのような以前の作業のように、クエリを解決するための固定された手順に従うのではなく、エージェントはモジュールへの各呼び出しの結果を使用してその後のステップを決定する。テキスト推論の分野での研究に触発されて、成功事例と失敗事例を区別する批評家を紹介した。筆者らのエージェントと批評家の組み合わせは,前述したデータセットに対して高い性能を発揮することを示す。

論文の概要: CAViAR: Critic-Augmented Video Agentic Reasoning

関連論文リスト