Fugu-MT 論文翻訳(概要): Structured Causal Video Reasoning via Multi-Objective Alignment

論文の概要: Structured Causal Video Reasoning via Multi-Objective Alignment

arxiv url: http://arxiv.org/abs/2604.04415v1
Date: Mon, 06 Apr 2026 04:49:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:19.09216
Title: Structured Causal Video Reasoning via Multi-Objective Alignment
Title（参考訳）: 多目的アライメントによる構造的因果ビデオ推論
Authors: Zinuo Li, Yongxin Guo, Jun Liu, Jiawei Zhan, Xi Jiang, Chengjie Wang, Mohammed Bennamoun, Farid Boussaid, Feng Zheng, Qiuhong Ke,
Abstract要約: そこで本稿では,解析段階に先立って,構造化イベントファクト (Structured Event Facts) と命名した,健全なイベントとその因果関係のコンパクトな表現を提案する。この構造化された事前は、簡潔で因果的根拠のある推論を促進するための明示的な制約として機能する。我々はCausalFact-60Kと4段階のトレーニングパイプラインを紹介し、事実のアライメント、フォーマットのウォームスタート、思考のウォームスタート、強化学習に基づくポストトレーニングを含む。
参考スコア（独自算出の注目度）: 102.61829546891543
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human understanding of video dynamics is typically grounded in a structured mental representation of entities, actions, and temporal relations, rather than relying solely on immediate deductive reasoning. In contrast, existing Video-LLMs largely depend on unstructured video reasoning, where critical visual evidence is embedded in verbose textual descriptions and temporal causality is often weakly modeled. This leads to inefficient processes and fragile causal inference. To bridge this cognitive gap, we propose constructing a compact representation of salient events and their causal relationships, which we name Structured Event Facts, prior to the reasoning stage. This structured prior serves as an explicit constraint to promote concise and causally grounded reasoning, while also making intermediate evidence easier to verify. To effectively train models on such structured facts, we introduce CausalFact-60K and a four-stage training pipeline comprising facts alignment, format warm-start, thinking warm-start, and reinforcement learning-based post-training. During RL stage, we find that this framework introduces competing objectives, as structural completeness and causal fidelity must be balanced against reasoning length, making it difficult to optimize. We address this challenge by formulating the optimization as a Multi-Objective Reinforcement Learning (MORL) problem and explicitly optimizing toward the Pareto-Frontier to balance these trade-offs. As a result, we introduce Factum-4B, which yields more reliable reasoning and delivers stronger performance on challenging video understanding tasks requiring fine-grained temporal inference.
Abstract（参考訳）: ビデオ力学の人間的理解は典型的には、即時帰納的推論にのみ依存するのではなく、実体、行動、時間的関係の構造化された精神的表現に基礎を置いている。対照的に、既存のビデオ-LLMは、重要な視覚的証拠が冗長なテキスト記述に埋め込まれ、時間的因果関係が弱くモデル化される、非構造的ビデオ推論に大きく依存している。これは非効率なプロセスと脆弱な因果推論をもたらす。この認知的ギャップを埋めるために、我々は、推論段階に先立って、構造化イベントファクト(Structured Event Facts)と命名した、健全な事象とその因果関係のコンパクトな表現を構築することを提案する。この構造化された事前は、簡潔で因果的根拠のある推論を促進するための明示的な制約として機能し、中間的な証拠を検証しやすくする。このような構造化事実に基づいてモデルを効果的に訓練するために,ファクト60K(CausalFact-60K)と4段階のトレーニングパイプラインを導入し,アライメント,フォーマットウォームスタート,思考ウォームスタート,強化学習に基づくポストトレーニングを行った。 RLの段階では、構造的完全性と因果的忠実性は推論長に対してバランスがとられなければならないため、最適化が困難である。この課題は、最適化を多目的強化学習(MORL)問題として定式化し、Pareto-Frontierに向けて明確に最適化することで解決する。その結果、Factum-4Bを導入し、より信頼性の高い推論を行い、よりきめ細かな時間的推論を必要とするビデオ理解タスクにおいて、より強力なパフォーマンスを提供する。

論文の概要: Structured Causal Video Reasoning via Multi-Objective Alignment

関連論文リスト