Fugu-MT 論文翻訳(概要): FrameThinker: Learning to Think with Long Videos via Multi-Turn Frame Spotlighting

論文の概要: FrameThinker: Learning to Think with Long Videos via Multi-Turn Frame Spotlighting

arxiv url: http://arxiv.org/abs/2509.24304v2
Date: Tue, 30 Sep 2025 01:55:37 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 12:20:10.412652
Title: FrameThinker: Learning to Think with Long Videos via Multi-Turn Frame Spotlighting
Title（参考訳）: FrameThinker: 長いビデオで考えることを学ぶには、マルチTurn Frame Spotlightingを使う
Authors: Zefeng He, Xiaoye Qu, Yafu Li, Siyuan Huang, Daizong Liu, Yu Cheng,
Abstract要約: 本稿では,長編ビデオによる思考の概念を紹介し,新しいフレームワークFrameThinkerを提案する。 FrameThinkerは,処理フレーム数を劇的に削減しつつ,ベースラインよりも+10.4%の大幅な平均改善を実現していることを示す。最も注目すべきは、7BモデルであるFrameThinkerがLongVideo-Reason上で新しい最先端技術を確立し、平均20.6フレームで76.1%の精度を実現したことです。
参考スコア（独自算出の注目度）: 62.25888935329454
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: While Large Vision-Language Models (LVLMs) have achieved substantial progress in video understanding, their application to long video reasoning is hindered by uniform frame sampling and static textual reasoning, which are inefficient and struggle to handle visually intensive video tasks. To overcome these challenges, in this paper, we introduce the concept of thinking with long videos and propose a novel framework FrameThinker. Within this framework, LVLMs are able to iteratively interrogate video content. Developing such video reasoning capabilities in LVLMs presents notable challenges, particularly in adapting the model to new video actions (e.g. select frame), and designing reward functions to guide LVLMs to adopt the newly introduced action. To solve these challenges, we propose a two-phase training strategy, first employing Supervised Fine-Tuning (SFT) to instill fundamental action capabilities, followed by Reinforcement Learning (RL) to optimize a strategic decision-making policy. Notably, in this RL phase, we conduct an in-depth and comprehensive exploration of the reward design for each action and format reward. Extensive experiments on reasoning benchmarks like Video-Holmes, LongVideo-Reason, and long-video understanding benchmarks such as LongVideoBench, MLVU, VideoMME, and LVBench, demonstrate that FrameThinker achieves a significant average improvement of +10.4% over baselines while drastically reducing the number of processed frames. Most notably, our 7B model, FrameThinker establishes a new state-of-the-art on LongVideo-Reason, achieving 76.1% accuracy using an average of only 20.6 frames. This not only outperforms the competitive LongVILA-R1 (72.0%) but does so with over 20x fewer frames (vs. 512), demonstrating unparalleled efficiency and effectiveness.
Abstract（参考訳）: LVLM(Large Vision-Language Models)は、ビデオ理解においてかなりの進歩を遂げているが、長いビデオ推論への応用は、一様フレームサンプリングと静的テキスト推論によって妨げられている。本稿では,これらの課題を克服するために,長編ビデオによる思考の概念を導入し,新しいフレームワークFrameThinkerを提案する。このフレームワーク内では、LVLMはビデオコンテンツを反復的に問うことができる。 LVLMでこのようなビデオ推論機能を開発することは、特に新しいビデオアクション(例えば、選択フレーム)にモデルを適応させ、新しく導入されたアクションを採用するためにLVLMを誘導する報酬関数を設計する際の顕著な課題を示す。これらの課題を解決するために,我々はまず,基本動作能力を具現化するためにスーパービジョン・ファインチューニング(SFT)を,戦略的意思決定方針を最適化するために強化学習(RL)を併用した2段階のトレーニング戦略を提案する。特に、このRLフェーズでは、各アクションおよびフォーマット報酬に対する報酬設計について、深く、包括的に調査する。 Video-Holmes、LongVideo-Reasonなどの推論ベンチマークやLongVideoBench、MLVU、VideoMME、LVBenchといったロングビデオ理解ベンチマークに関する大規模な実験は、FrameThinkerがベースラインよりも+10.4%向上し、処理されたフレームの数を劇的に削減していることを示した。最も注目すべきは、7BモデルであるFrameThinkerがLongVideo-Reason上で新しい最先端技術を確立し、平均20.6フレームで76.1%の精度を実現したことです。これは競合するLongVILA-R1(72.0%)を上回るだけでなく、20倍以上のフレーム(vs.512)で性能を向上し、非並列効率と有効性を示している。

論文の概要: FrameThinker: Learning to Think with Long Videos via Multi-Turn Frame Spotlighting

関連論文リスト