2026-04-06 Daily Report: Structured Causal Video Reasoning via Multi-Objective Alignment

Structured Causal Video Reasoning via Multi-Objective Alignment

Authors Zinuo Li, Yongxin Guo, Jun Liu, Jiawei Zhan, Xi Jiang, Chengjie Wang, Mohammed Bennamoun, Farid Boussaid, Feng Zheng, Qiuhong Ke

Affiliations Tencent / The University of Western Australia / Southern University of Science and Technology / Monash University / The Chinese University of Hong Kong

Categories Task / Video Reasoning / Causal reasoning over structured events, Method / Representation Learning / Compact causal event representation, Evaluation / Model Training / Multi-stage training pipeline for alignment

License CC BY 4.0

Abstract Overview

This paper proposes a structure-first framework for video reasoning in which a model first produces Structured Event Facts—compact, time-ordered descriptions of salient events and their causal relations—and then reasons under those constraints. The method is trained with CausalFact-60K and a four-stage pipeline covering facts alignment, format warm-start, thinking warm-start, and reinforcement-learning-based post-training. To handle conflicting reinforcement learning objectives such as structural completeness, causal fidelity, task accuracy, and reasoning length, the authors introduce Pareto-Frontier guided Advantage Balancing (P-FAB), which treats reward components as separate objectives and solves a minimum-norm problem via Frank-Wolfe to compute a compromise update direction. The resulting 4B-parameter model, Factum-4B, is evaluated on temporal grounding and broader video understanding benchmarks against both open-source and closed-source baselines.

Novelty

The paper's main novelty is the explicit use of Structured Event Facts as an intermediate representation that constrains subsequent causal reasoning, replacing unconstrained chain-of-thought over video. It also introduces P-FAB, a Pareto-frontier-inspired multi-objective RL method that dynamically balances competing reward signals during post-training by solving a minimum-norm problem in standardized reward space, together with the CausalFact-60K data pipeline and a four-stage curriculum designed to stabilize this structure-first behavior.

Results

In ablations, removing the facts or thinking stage degrades performance across all benchmarks, and RL post-training improves ActivityNet-Captions R1@0.3 from 61.5 to 69.8. Factum-4B achieves 57.1/40.4/21.6 on Charades-TimeLens and 69.8/48.4/28.1 on ActivityNet-Captions (R1@0.3/0.5/0.7), while reaching 64.7 on VideoMME and 73.6 on NExT-GQA, setting new open-source state-of-the-art results on temporal grounding at 4B scale. P-FAB consistently outperforms standard GRPO, with the margin widening from 1.2% to 2.5% on ActivityNet R1@0.3 as group size increases from 4 to 8.

Key Points

The method separates video reasoning into a structured facts extraction stage followed by causally constrained thinking, aiming to reduce verbose and weakly grounded reasoning; ablations confirm both stages are necessary, with their removal causing consistent performance drops across temporal grounding and general understanding benchmarks.
Training relies on CausalFact-60K and a four-stage curriculum including an intermediate format warm-start (Stage 1.5) to stabilize the required reasoning structure before full causal reasoning and RL alignment; the authors note that skipping this stage causes the model to hallucinate bad structure.
The P-FAB multi-objective RL algorithm dynamically balances competing reward signals by solving a minimum-norm problem over standardized per-objective advantages, consistently outperforming standard GRPO with larger gains at group size 8, though the authors acknowledge that limited training data currently constrains performance on some general video understanding tasks.

References

arXiv: https://arxiv.org/abs/2604.04415v1
Fugu-MT: https://fugumt.com/fugumt/paper_check/2604.04415v1