Fugu-MT 論文翻訳(概要): Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning

論文の概要: Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning

arxiv url: http://arxiv.org/abs/2605.21988v1
Date: Thu, 21 May 2026 04:38:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-22 16:35:42.093007
Title: Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning
Title（参考訳）: 対実強化学習によるビデオLLMの時空間感度の学習
Authors: Dazhao Du, Jian Liu, Jialong Qin, Tao Han, Bohai Gu, Fangqi Zhu, Yujia Zhang, Eric Liu, Xi Chen, Song Guo,
Abstract要約: ビデオ大言語モデル(ビデオモデル)は、強力なベンチマーク精度を達成するが、ダイナミックスを追跡するのではなく、シングルフレームキューや言語先行といったショートカットを通じてビデオ質問に答えることが多い。この問題はRLポストトレーニングにおいてさらに悪化しており、ビデオダイナミクスを追跡せずに高い報酬を得るショートカットポリシーをさらに強化することができる。視覚世界が変化したとしても、答えは変わっているか、同じままか、という問題に対処する。
参考スコア（独自算出の注目度）: 24.29101453473451
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video large language models (Video LLMs) achieve strong benchmark accuracy, yet often answer video questions through shortcuts such as single-frame cues and language priors rather than by tracking spatiotemporal dynamics. This issue is exacerbated in RL post-training, where correctness-only rewards can further reinforce shortcut policies that obtain high reward without tracking video dynamics. We address this by asking a controlled counterfactual question: if the visual world changed while the question remained fixed, should the answer change or stay the same? Based on this view, we propose \textbf{Counterfactual Relational Policy Optimization (CRPO)}, a dual-branch RL framework for improving \emph{spatiotemporal sensitivity}. CRPO constructs counterfactual videos through horizontal flips and temporal reversals, trains on both original and counterfactual branches, and introduces a \textbf{Counterfactual Relation Reward (CRR)} between their answers. CRR encourages answers to change for dynamic questions and remain unchanged for static questions. This cross-branch constraint makes it difficult for shortcut policies to be consistently rewarded across both branches. To evaluate this property, we introduce \textbf{DyBench}, a paired counterfactual video benchmark with 3,014 videos covering reversible dynamics, moving direction, and event sequence, together with a strict pair-accuracy metric that prevents fixed-answer shortcuts from inflating scores. Experiments show that CRPO outperforms prior RL methods on spatiotemporal-sensitive evaluations while maintaining competitive general video performance. On Qwen3-VL-8B, CRPO improves DyBench P-Acc by +7.7 and TimeBlind I-Acc by +8.2 over the base model, indicating improved spatiotemporal sensitivity rather than stronger reliance on static shortcuts. The project website can be found at https://ddz16.github.io/crpo.github.io/ .
Abstract（参考訳）: ビデオ大言語モデル(ビデオLLM)は、強力なベンチマーク精度を達成するが、時空間力学を追跡するのではなく、単一フレームキューや言語先行といったショートカットを通じて、ビデオの質問に答えることが多い。この問題はRLポストトレーニングにおいてさらに悪化しており、ビデオダイナミクスを追跡せずに高い報酬を得るショートカットポリシーをさらに強化することができる。視覚世界が変化したとしても、疑問が固まっていれば、答えは変わりますか、同じままなのか? この観点から,両分岐型RLフレームワークである \textbf{Counterfactal Relational Policy Optimization (CRPO) を提案する。 CRPOは水平フリップと時間反転による反ファクトビデオを構築し、元のブランチと反ファクトのブランチをトレーニングし、その回答の間に \textbf{Counterfactual Relation Reward (CRR) を導入する。 CRRは、動的質問の回答を奨励し、静的質問の回答は変わらない。このクロスブランチの制約により、両方のブランチに一貫した報酬を与えるショートカットポリシが困難になる。この特性を評価するために,3,014本の動画が可逆的ダイナミクス,移動方向,イベントシーケンスをカバーしたペアビデオベンチマークである‘textbf{DyBench} と,固定応答ショートカットによるスコアの膨らみを防止する厳密なペア精度指標を紹介する。実験により、CRPOは、競争力のある汎用映像性能を維持しつつ、時空間感応評価において、従来のRL法よりも優れていることが示された。 Qwen3-VL-8Bでは、CRPOはDyBench P-Accを+7.7で改善し、TimeBlind I-Accを+8.2で改善した。プロジェクトのWebサイトはhttps://ddz16.github.io/crpo.github.io/で見ることができる。

論文の概要: Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning

関連論文リスト