Fugu-MT 論文翻訳(概要): video-SALMONN-R$^3$: Learning to ReWatch, ReAsk, and ReAnswer for Efficient Video Understanding

論文の概要: video-SALMONN-R$^3$: Learning to ReWatch, ReAsk, and ReAnswer for Efficient Video Understanding

arxiv url: http://arxiv.org/abs/2606.24477v1
Date: Tue, 23 Jun 2026 12:13:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-24 22:16:48.938111
Title: video-SALMONN-R$^3$: Learning to ReWatch, ReAsk, and ReAnswer for Efficient Video Understanding
Title（参考訳）: video-SALMONN-R$^3$: Learning to ReWatch, ReAsk, and ReAnswer for Efficient Video Understanding
Authors: Yixuan Li, Guangzhi Sun, Yudong Yang, Wei Li, Zejun MA, Chao Zhang,
Abstract要約: ビデオ大言語モデル(LLM)は計算やメモリの予算によって制約されることが多い。 Video-SALMONN-R$3$は、強化学習による再視聴を可能にする最初のエンドツーエンドビデオLLMである。
参考スコア（独自算出の注目度）: 49.5110151334134
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video large language models (LLMs) are often constrained by computation and memory budgets, leading them to use reduced frame rates and spatial resolutions, which may cause them to miss critical information for question answering (QA). A practical and efficient solution is a two-stage paradigm: first perform coarse video understanding to localize relevant segments, and then re-watch these segments at higher temporal or spatial fidelity. In this paper, we present video-SALMONN-R$^3$, the first end-to-end video-LLM that enables re-watch through reinforcement learning without relying on chain-of-thought (CoT) cold-start. This design removes the need for costly CoT data annotations and avoids CoT-based supervised fine-tuning (SFT), which can otherwise degrade the pretrained video understanding abilities. To address the mismatch between the reasoning-first behavior induced by re-watch and the answer-first tendency of pretrained video-LLMs, we propose a re-answer strategy, in which the model first produces a direct answer in the first watch and then refines it after re-watching. Finally, to improve question adherence during re-watching, we propose a re-ask mechanism that re-injects the query when revisiting localized segments. Experimental results show that video-SALMONN-R$^3$ consistently outperforms both the base model and the QA-SFT baseline, while surpassing prior re-watch-based approaches with significantly lower computational cost. Code, models, and data will be publicly released upon acceptance.
Abstract（参考訳）: ビデオ大言語モデル(LLM)は、しばしば計算とメモリ予算によって制約されるため、フレームレートと空間解像度を削減し、質問応答(QA)の重要な情報を見逃してしまう可能性がある。実用的で効率的な解決策は2段階のパラダイムであり、まず粗いビデオ理解を行い、関連するセグメントをローカライズし、そのセグメントを高時間的または空間的忠実度で再視聴する。本稿では, チェーン・オブ・ソート(CoT)コールドスタートに頼ることなく, 強化学習による再視聴が可能な, 初のエンドツーエンドビデオLLMである video-SALMONN-R$^3$ を提案する。この設計はコストのかかるCoTデータアノテーションの必要性を排除し、CoTベースの教師付き微調整(SFT)を避ける。再視聴によって引き起こされる推論ファースト行動と事前学習されたビデオLLMの回答ファースト傾向とのミスマッチに対処するため,本研究では,モデルが最初に第1ウォッチで直接回答を生成し,再視聴後に改善する手法を提案する。最後に、再視聴時の質問の付着性を改善するために、局所的なセグメントを再考する際に、クエリを再注入するre-ask機構を提案する。実験結果から,ビデオSALMONN-R$^3$は,従来のリウォッチベースアプローチをはるかに低い計算コストで上回りながら,ベースモデルとQA-SFTベースラインの両方を一貫して上回ることがわかった。コード、モデル、データは受理時に公開される。

論文の概要: video-SALMONN-R$^3$: Learning to ReWatch, ReAsk, and ReAnswer for Efficient Video Understanding

関連論文リスト