Structured Causal Video Reasoning via Multi-Objective Alignment
Abstract Overview
This paper proposes a structure-first framework for video reasoning in which a model first produces Structured Event Facts—compact, time-ordered descriptions of salient events and their causal relations—and then reasons under those constraints. The method is trained with CausalFact-60K and a four-stage pipeline covering facts alignment, format warm-start, thinking warm-start, and reinforcement-learning-based post-training. To handle conflicting reinforcement learning objectives such as structural completeness, causal fidelity, task accuracy, and reasoning length, the authors introduce Pareto-Frontier guided Advantage Balancing (P-FAB), which treats reward components as separate objectives and solves a minimum-norm problem via Frank-Wolfe to compute a compromise update direction. The resulting 4B-parameter model, Factum-4B, is evaluated on temporal grounding and broader video understanding benchmarks against both open-source and closed-source baselines.
Novelty
The paper's main novelty is the explicit use of Structured Event Facts as an intermediate representation that constrains subsequent causal reasoning, replacing unconstrained chain-of-thought over video. It also introduces P-FAB, a Pareto-frontier-inspired multi-objective RL method that dynamically balances competing reward signals during post-training by solving a minimum-norm problem in standardized reward space, together with the CausalFact-60K data pipeline and a four-stage curriculum designed to stabilize this structure-first behavior.
Results
In ablations, removing the facts or thinking stage degrades performance across all benchmarks, and RL post-training improves ActivityNet-Captions R1@0.3 from 61.5 to 69.8. Factum-4B achieves 57.1/40.4/21.6 on Charades-TimeLens and 69.8/48.4/28.1 on ActivityNet-Captions (R1@0.3/0.5/0.7), while reaching 64.7 on VideoMME and 73.6 on NExT-GQA, setting new open-source state-of-the-art results on temporal grounding at 4B scale. P-FAB consistently outperforms standard GRPO, with the margin widening from 1.2% to 2.5% on ActivityNet R1@0.3 as group size increases from 4 to 8.
Key Points
- The method separates video reasoning into a structured facts extraction stage followed by causally constrained thinking, aiming to reduce verbose and weakly grounded reasoning; ablations confirm both stages are necessary, with their removal causing consistent performance drops across temporal grounding and general understanding benchmarks.
- Training relies on CausalFact-60K and a four-stage curriculum including an intermediate format warm-start (Stage 1.5) to stabilize the required reasoning structure before full causal reasoning and RL alignment; the authors note that skipping this stage causes the model to hallucinate bad structure.
- The P-FAB multi-objective RL algorithm dynamically balances competing reward signals by solving a minimum-norm problem over standardized per-objective advantages, consistently outperforming standard GRPO with larger gains at group size 8, though the authors acknowledge that limited training data currently constrains performance on some general video understanding tasks.