Fugu-MT 論文翻訳(概要): When Thinking Hurts: Mitigating Visual Forgetting in Video Reasoning via Frame Repetition

論文の概要: When Thinking Hurts: Mitigating Visual Forgetting in Video Reasoning via Frame Repetition

arxiv url: http://arxiv.org/abs/2603.16256v1
Date: Tue, 17 Mar 2026 08:41:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.177123
Title: When Thinking Hurts: Mitigating Visual Forgetting in Video Reasoning via Frame Repetition
Title（参考訳）: ハートを考える:フレーム反復によるビデオ推論における視覚的フォーミングの軽減
Authors: Xiaokun Sun, Yubo Wang, Haoyu Cao, Linli Xu,
Abstract要約: Video Question Answeringでは、モデルはますます自己生成テキストに依存し、視覚的な入力を横取りし、幻覚を引き起こす。軽量な繰り返しスコアリングモジュールを備えた自動拡張フレームワークであるFrameRepeatを提案する。 FrameRepeatは推論過程において重要な視覚的手がかりの強化に有効かつ一般化可能であることを示す。
参考スコア（独自算出の注目度）: 22.037040360505742
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, Multimodal Large Language Models (MLLMs) have demonstrated significant potential in complex visual tasks through the integration of Chain-of-Thought (CoT) reasoning. However, in Video Question Answering, extended thinking processes do not consistently yield performance gains and may even lead to degradation due to ``visual anchor drifting'', where models increasingly rely on self-generated text, sidelining visual inputs and causing hallucinations. While existing mitigations typically introduce specific mechanisms for the model to re-attend to visual inputs during inference, these approaches often incur prohibitive training costs and suffer from poor generalizability across different architectures. To address this, we propose FrameRepeat, an automated enhancement framework which features a lightweight repeat scoring module that enables Video-LLMs to autonomously identify which frames should be reinforced. We introduce a novel training strategy, Add-One-In (AOI), that uses MLLM output probabilities to generate supervision signals representing repeat gain. This can be used to train a frame scoring network, which guides the frame repetition behavior. Experimental results across multiple models and datasets demonstrate that FrameRepeat is both effective and generalizable in strengthening important visual cues during the reasoning process.
Abstract（参考訳）: 近年,Multimodal Large Language Models (MLLMs) は,Chain-of-Thought (CoT) 推論を統合することで,複雑な視覚タスクに有意な可能性を証明している。しかしながら、ビデオ質問回答(英語版)では、拡張思考プロセスは、常にパフォーマンスの向上を得られず、「視覚的アンカードリフト」によって、モデルがますます自己生成テキストに依存し、視覚的な入力を横取りし、幻覚を引き起こすことによる劣化を招きかねない。既存の緩和は一般的に、推論中に視覚的な入力に再従属するための特定のメカニズムを導入するが、これらのアプローチは、しばしば禁止的なトレーニングコストを発生させ、異なるアーキテクチャ間での一般化性の低下に悩まされる。これを解決するために、ビデオLLMがどのフレームを補強すべきかを自律的に識別できる軽量な繰り返しスコアリングモジュールを備えた自動拡張フレームワークFrameRepeatを提案する。本稿では,MLLM出力確率を用いてリピートゲインを表す監視信号を生成する新しいトレーニング戦略であるAdd-One-In(AOI)を紹介する。これはフレームの繰り返し動作をガイドするフレームスコアリングネットワークのトレーニングに使用することができる。複数のモデルとデータセットにまたがる実験結果から、FrameRepeatは推論過程において重要な視覚的手がかりを強化するのに効果的であり、一般化可能であることが示された。

論文の概要: When Thinking Hurts: Mitigating Visual Forgetting in Video Reasoning via Frame Repetition

関連論文リスト