Fugu-MT 論文翻訳(概要): Find, Fix, Reason: Context Repair for Video Reasoning

論文の概要: Find, Fix, Reason: Context Repair for Video Reasoning

arxiv url: http://arxiv.org/abs/2604.16243v1
Date: Fri, 17 Apr 2026 17:04:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-20 22:00:20.018839
Title: Find, Fix, Reason: Context Repair for Video Reasoning
Title（参考訳）: Find, Fix, Reason: ビデオ推論のためのコンテキスト修復
Authors: Haojian Huang, Chuanyu Qin, Yinchuan Li, Yingcong Chen,
Abstract要約: 強化学習は、大規模なマルチモーダルモデルにおいて高度なビデオ推論を持つ。凍結したツール統合された教師は、時間的依存の欠如を認識し、最小限のエビデンスパッチを提供する。本稿では,正解による結果の妥当性と依存性の整合性という2つの目標に最適化を整合させるロバスト改善リワード(RIR)を提案する。
参考スコア（独自算出の注目度）: 45.021693494492666
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning has advanced video reasoning in large multi-modal models, yet dominant pipelines either rely on on-policy self-exploration, which plateaus at the model's knowledge boundary, or hybrid replay that mixes policies and demands careful regularization. Dynamic context methods zoom into focused evidence but often require curated pretraining and two-stage tuning, and their context remains bounded by a small model's capability. In contrast, larger models excel at instruction following and multi-modal understanding, can supply richer context to smaller models, and rapidly zoom in on target regions via simple tools. Building on this capability, we introduce an observation-level intervention: a frozen, tool-integrated teacher identifies the missing spatiotemporal dependency and provides a minimal evidence patch (e.g., timestamps, regions etc.) from the original video while the question remains unchanged. The student answers again with the added context, and training updates with a chosen-rollout scheme integrated into Group Relative Policy Optimization (GRPO). We further propose a Robust Improvement Reward (RIR) that aligns optimization with two goals: outcome validity through correct answers and dependency alignment through rationales that reflect the cited evidence. Advantages are group-normalized across the batch, preserving on-policy exploration while directing it along causally meaningful directions with minimal changes to the training stack. Experiments on various related benchmarks show consistent accuracy gains and strong generalization. Web page and source code will be available at https://github.com/JethroJames/FFR.git.
Abstract（参考訳）: 強化学習は、大規模なマルチモーダルモデルで高度なビデオ推論を行うが、支配的なパイプラインは、そのモデルの知識境界を台無しにするオン・ポリティ・セルフ探索に依存するか、ポリシーと慎重に規則化を要求するハイブリッド・リプレイに依存している。動的コンテキストメソッドは、集中したエビデンスにズームインするが、しばしばキュレートされた事前訓練と2段階のチューニングを必要とし、そのコンテキストは小さなモデルの能力によって拘束される。対照的に、より大きなモデルは命令追従とマルチモーダル理解に優れ、より小さなモデルによりリッチなコンテキストを提供し、単純なツールでターゲット領域に素早くズームインすることができる。凍結されたツール統合された教師は、欠落した時空間依存性を特定し、質問が変わらず、元のビデオから最小限のエビデンスパッチ(タイムスタンプ、エリアなど)を提供する。学生は、追加のコンテキストで再び回答し、グループ相対ポリシー最適化(GRPO)に統合された選択されたロールアウトスキームによるトレーニング更新を行う。さらに、最適化を正解による結果の妥当性と、引用された証拠を反映した有理性による依存性の整合性という2つの目標に整合させるロバスト改善リワード(RIR)を提案する。メリットは、バッチ全体にわたってグループ正規化され、トレーニングスタックの変更を最小限にして、因果的に意味のある方向に沿ってそれを指示しながら、政治上の探索を保存する。様々なベンチマークの実験では、一貫した精度向上と強い一般化が示されている。 Webページとソースコードはhttps://github.com/JethroJames/FFR.gitで入手できる。

論文の概要: Find, Fix, Reason: Context Repair for Video Reasoning

関連論文リスト