Fugu-MT 論文翻訳(概要): Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs

論文の概要: Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs

arxiv url: http://arxiv.org/abs/2605.01324v2
Date: Tue, 05 May 2026 09:38:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-06 14:45:21.240831
Title: Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs
Title（参考訳）: 知覚的ショートカットを超えて:軽量MLLMにおける一般化可能なビデオ推論のための因果的インスパイアされたデバイアス最適化
Authors: Jingze Wu, Quan Zhang, Hongfei Suo, Zeqiang Cai, Hongbo Chen,
Abstract要約: 本稿では,2段階のデバイアス化プロセスを通じて,軽量モデルにおけるロバスト推論を育むフレームワークを提案する。私たちのモデルであるVideoThinker-R1は、ビデオ推論効率の新たな最先端を確立します。
参考スコア（独自算出の注目度）: 11.567226738245175
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although reinforcement learning (RL) has significantly advanced reasoning capabilities in large multimodal language models (MLLMs), its efficacy remains limited for lightweight models essential for edge deployments. To address this issue, we leverage causal analysis and experiment to reveal the underlying phenomenon of perceptual bias, demonstrating that RL-based fine-tuning compels lightweight models to preferentially adopt perceptual shortcuts induced by data biases, rather than developing genuine reasoning abilities. Motivated by this insight, we propose VideoThinker, a causal-inspired framework that cultivates robust reasoning in lightweight models through a two-stage debiasing process. First, the Bias Aware Training stage forges a dedicated "bias model" to embody these shortcut behaviors. Then, the Causal Debiasing Policy Optimization (CDPO) algorithm fine-tunes the primary model, employing an innovative repulsive objective to actively push it away from the bias model's flawed logic while simultaneously pulling it toward correct, generalizable solutions. Our model, VideoThinker-R1, establishes a new state-of-the-art in video reasoning efficiency. For same-scale comparison, requiring no Supervised Fine-Tuning (SFT) and using only 1 of the training data for RL, it surpasses VideoRFT-3B with a 3.2% average gain on widely-used benchmarks and a 7% lead on VideoMME. For cross-scale comparison, it outperforms the larger Video-UTR-7B model on multiple benchmarks, including a 2.1% gain on MVBench and a 3.8% gain on TempCompass. Code is available at https://github.com/falonss703/VideoThinker.
Abstract（参考訳）: 強化学習 (RL) は大規模マルチモーダル言語モデル (MLLM) においてかなり高度な推論能力を有するが, エッジ展開に不可欠な軽量モデルに対して有効性は依然として限られている。この問題に対処するために、我々は因果解析と実験を利用して知覚バイアスの基本的な現象を明らかにし、RLに基づく微調整は、真の推論能力ではなく、データバイアスによって引き起こされる知覚的ショートカットを優先的に採用する軽量モデルを示す。この知見に触発されたVideoThinkerは,2段階のデバイアス処理を通じて,軽量モデルにおけるロバスト推論を育む因果的インスパイアされたフレームワークである。まず、バイアスアウェアトレーニングステージは、これらのショートカットの振る舞いを具現化するために、専用の「バイアスモデル」を構築します。次に、因果脱バイアス政策最適化(CDPO)アルゴリズムが一次モデルを微調整し、革新的な反発的目的を利用してバイアスモデルの欠陥論理から積極的に切り離し、同時に正しい一般化可能な解へと引き上げる。私たちのモデルであるVideoThinker-R1は、ビデオ推論効率の新たな最先端を確立します。同規模の比較では、スーパーバイザード・ファインチューニング(SFT)を必要とせず、RLのトレーニングデータのうち1つしか使用せず、ビデオRFT-3Bを3.2%上回り、ビデオMMEで7%リードしている。クロススケール比較では、MVBenchの2.1%、TempCompassの3.8%など、複数のベンチマークで大きなビデオ-UTR-7Bモデルを上回っている。コードはhttps://github.com/falonss703/VideoThinkerで入手できる。

論文の概要: Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs

関連論文リスト