CausalStep: A Benchmark for Explicit Stepwise Causal Reasoning in Videos
- URL: http://arxiv.org/abs/2507.16878v1
- Date: Tue, 22 Jul 2025 12:29:13 GMT
- Title: CausalStep: A Benchmark for Explicit Stepwise Causal Reasoning in Videos
- Authors: Xuchen Li, Xuzhao Li, Shiyu Hu, Kaiqi Huang, Wentao Zhang,
- Abstract summary: CausalStep is a benchmark for explicit stepwise causal reasoning in videos.<n>It segments videos into causally linked units and enforces a strict stepwise question-answer protocol.<n>The benchmark features 100 videos across six categories and 1,852 multiple-choice QA pairs.
- Score: 39.4914316747781
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in large language models (LLMs) have improved reasoning in text and image domains, yet achieving robust video reasoning remains a significant challenge. Existing video benchmarks mainly assess shallow understanding and reasoning and allow models to exploit global context, failing to rigorously evaluate true causal and stepwise reasoning. We present CausalStep, a benchmark designed for explicit stepwise causal reasoning in videos. CausalStep segments videos into causally linked units and enforces a strict stepwise question-answer (QA) protocol, requiring sequential answers and preventing shortcut solutions. Each question includes carefully constructed distractors based on error type taxonomy to ensure diagnostic value. The benchmark features 100 videos across six categories and 1,852 multiple-choice QA pairs. We introduce seven diagnostic metrics for comprehensive evaluation, enabling precise diagnosis of causal reasoning capabilities. Experiments with leading proprietary and open-source models, as well as human baselines, reveal a significant gap between current models and human-level stepwise reasoning. CausalStep provides a rigorous benchmark to drive progress in robust and interpretable video reasoning.
Related papers
- ImplicitQA: Going beyond frames towards Implicit Video Reasoning [36.65883181090953]
ImplicitQA is a novel benchmark designed to test models on implicit reasoning.<n>It comprises 1K meticulously annotated QA pairs derived from 320+ high-quality creative video clips.
arXiv Detail & Related papers (2025-06-26T19:53:54Z) - Fractional Reasoning via Latent Steering Vectors Improves Inference Time Compute [57.16286134405821]
We propose Fractional Reasoning, a framework that enables continuous control over reasoning intensity at inference time.<n>Our method operates by extracting the latent steering vector associated with deeper reasoning and reapplying it with a tunable scaling factor.<n> Experiments on GSM8K, MATH500, and GPQA demonstrate that Fractional Reasoning consistently improves performance across diverse reasoning tasks and models.
arXiv Detail & Related papers (2025-06-18T21:15:59Z) - Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning? [56.06537213958482]
We present Video-Holmes, a benchmark designed to evaluate the complex video reasoning capabilities of MLLMs.<n>Video-Holmes consists of 1,837 questions derived from 270 manually annotated suspense short films.<n>Our comprehensive evaluation of state-of-the-art MLLMs reveals that, while these models generally excel at visual perception, they encounter substantial difficulties with integrating information.
arXiv Detail & Related papers (2025-05-27T16:05:01Z) - VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models [29.706347050700867]
We introduce a novel benchmark named Video-based long-form Causal Reasoning (VCRBench)<n>VCRBench tests whether Large Video Language Models (LVLMs) can identify, reason about, and correctly sequence the events needed to accomplish a specific goal.<n>We propose Recognition-Reasoning Decomposition (RRD), a modular approach that breaks video-based causal reasoning into two sub-tasks of video recognition and causal reasoning.
arXiv Detail & Related papers (2025-05-13T11:35:58Z) - MINERVA: Evaluating Complex Video Reasoning [72.12644008002566]
We provide a new video reasoning dataset called MINERVA for modern multimodal models.<n>Our dataset is multimodal, diverse in terms of video domain and length, and consists of complex multi-step questions.<n>We perform fine-grained error analysis to identify common failure modes across various models, and create a taxonomy of reasoning errors.
arXiv Detail & Related papers (2025-05-01T17:41:49Z) - VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning [22.357792064450955]
Chain-of-Thought (CoT) reasoning has significantly enhanced the capabilities of large language models (LLMs) and large vision-language models (LVLMs)<n>We introduce VCR-Bench, a novel benchmark designed to comprehensively evaluate LVLMs' Video Chain-of-Thought Reasoning capabilities.<n> VCR-Bench comprises 859 videos spanning a variety of video content and durations, along with 1,034 high-quality question-answer pairs.
arXiv Detail & Related papers (2025-04-10T17:59:03Z) - VACT: A Video Automatic Causal Testing System and a Benchmark [55.53300306960048]
VACT is an **automated** framework for modeling, evaluating, and measuring the causal understanding of VGMs in real-world scenarios.<n>We introduce multi-level causal evaluation metrics to provide a detailed analysis of the causal performance of VGMs.
arXiv Detail & Related papers (2025-03-08T10:54:42Z) - Unveiling Reasoning Thresholds in Language Models: Scaling, Fine-Tuning, and Interpretability through Attention Maps [3.8936716676293917]
This study investigates the in-context learning capabilities of various decoder-only transformer-based language models with different model sizes and training data.<n>We identify a critical parameter threshold (1.6 billion), beyond which reasoning performance improves significantly in tasks such as commonsense reasoning in multiple-choice question answering and deductive reasoning.
arXiv Detail & Related papers (2025-02-21T00:48:32Z) - STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training [87.58996020705258]
Video Large Language Models (Video-LLMs) have recently shown strong derivation in basic video understanding tasks.<n>Video-LLMs struggle with compositional reasoning that requires multi-step explicit-temporal inference across object relations, interactions and events.<n>We propose STEP, a novel graph-guided self-training method that enables VideoLLMs to generate reasoning-rich finetuning data from any raw videos to improve itself.
arXiv Detail & Related papers (2024-11-29T11:54:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.