Fugu-MT 論文翻訳(概要): Towards Sparse Video Understanding and Reasoning

論文の概要: Towards Sparse Video Understanding and Reasoning

arxiv url: http://arxiv.org/abs/2602.13602v1
Date: Sat, 14 Feb 2026 04:52:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-17 14:17:28.231941
Title: Towards Sparse Video Understanding and Reasoning
Title（参考訳）: Sparse Video Understanding and Reasoning に向けて
Authors: Chenwei Xu, Zhen Ye, Shang Wu, Weijian Li, Zihan Wang, Zhuofan Xia, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Han Liu,
Abstract要約: revise (underlineReasoning with UnderlineVideo UnderlineSparsity)は、ビデオ質問応答のためのマルチラウンドエージェントである。プラグイン・アンド・プレイの設定でプロプライエタリなヴィジュアル言語モデルをサポートし、オープンソースモデルの強化微調整を可能にする。 reviseはフレーム、ラウンド、トークンのプロンプトを減らしながら精度を向上し、実用的なスパースビデオ推論を実証する。
参考スコア（独自算出の注目度）: 40.34786499758545
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present \revise (\underline{Re}asoning with \underline{Vi}deo \underline{S}parsity), a multi-round agent for video question answering (VQA). Instead of uniformly sampling frames, \revise selects a small set of informative frames, maintains a summary-as-state across rounds, and stops early when confident. It supports proprietary vision-language models (VLMs) in a ``plug-and-play'' setting and enables reinforcement fine-tuning for open-source models. For fine-tuning, we introduce EAGER (Evidence-Adjusted Gain for Efficient Reasoning), an annotation-free reward with three terms: (1) Confidence gain: after new frames are added, we reward the increase in the log-odds gap between the correct option and the strongest alternative; (2) Summary sufficiency: at answer time we re-ask using only the last committed summary and reward success; (3) Correct-and-early stop: answering correctly within a small turn budget is rewarded. Across multiple VQA benchmarks, \revise improves accuracy while reducing frames, rounds, and prompt tokens, demonstrating practical sparse video reasoning.
Abstract（参考訳）: 本稿では,ビデオ質問応答のためのマルチラウンドエージェントである \revise (\underline{Re}asoning with \underline{Vi}deo \underline{S}parsity) を提案する。フレームを均一にサンプリングする代わりに、Shareviseは情報フレームの小さなセットを選択し、ラウンド全体にわたってサマリ・アズ・ステートを維持し、自信のある場合には早期に停止する。プロプライエタリなヴィジュアル言語モデル(VLM)を `plug-and-play' 設定でサポートし、オープンソースモデルの強化微調整を可能にする。 EAGER(Evidence-Adjusted Gain for Efficient Reasoning)は,(1)信頼獲得:新しいフレームを追加すると,正しい選択と最強の選択肢の間の対数差の増加に報いる;(2)要約:答えの時間:最後にコミットした要約と報酬の成功のみを使用して再割り当てする;(3)修正と早期停止:小さな予算内で正しく回答する,という3つの用語のアノテーションのない報酬である。複数のVQAベンチマークを通じて、Shareviseはフレーム、ラウンド、トークンのプロンプトを削減しながら精度を向上し、実用的なスパースビデオ推論を実証する。

論文の概要: Towards Sparse Video Understanding and Reasoning

関連論文リスト