Related papers: Can you SPLICE it together? A Human Curated Benchmark for Probing Visual Reasoning in VLMs

Can you SPLICE it together? A Human Curated Benchmark for Probing Visual Reasoning in VLMs

URL: http://arxiv.org/abs/2509.24640v1
Date: Mon, 29 Sep 2025 11:50:18 GMT
Title: Can you SPLICE it together? A Human Curated Benchmark for Probing Visual Reasoning in VLMs
Authors: Mohamad Ballout, Okajevo Wilfred, Seyedalireza Yaghoubi, Nohayr Muhammad Abdelmoneim, Julius Mayer, Elia Bruni,
Abstract summary: SPLICE is a benchmark designed to probe event-based reasoning across multiple dimensions.<n>It includes 3,381 human-filtered videos spanning 12 and 180 sub-categories, such as sports, engineering, and housework.<n>We evaluate both human participants and state-of-the-art vision-language models (VLMs) on the task of rearranging these clips into coherent event sequences.
Score: 3.6431181688181504
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this work, we introduce SPLICE, a human-curated benchmark derived from the COIN instructional video dataset, designed to probe event-based reasoning across multiple dimensions: temporal, causal, spatial, contextual, and general knowledge. SPLICE includes 3,381 human-filtered videos spanning 12 categories and 180 sub-categories, such as sports, engineering, and housework. These videos are segmented into a total of 11,423 event clips. We evaluate both human participants and state-of-the-art vision-language models (VLMs) on the task of rearranging these clips into coherent event sequences to assess visual reasoning capabilities. Results reveal a significant gap: VLMs struggle to match human performance. While human-annotated textual descriptions improve model accuracy, they do not affect human performance, suggesting that models rely more on language priors than on visual understanding. Even with annotations, VLMs fall short of human-level reasoning, underscoring persistent challenges in visual reasoning. A deeper analysis across sub-categories shows that VLMs perform relatively better on videos where temporal and causal reasoning are dominant, compared to those where contextual and spatial reasoning are dominant. They also perform better on everyday tasks than on specialized ones.

Related papers

GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them? [76.67205289006795]
GLIMPSE consists of 3,269 videos and over 4,342 highly visual-centric questions across 11 categories.<n>All questions are carefully crafted by human annotators and require watching the entire video and reasoning over full video context.<n>In human evaluations, GLIMPSE achieves 94.82% accuracy, but current LVLMs face significant challenges.
arXiv Detail & Related papers (2025-07-13T04:44:57Z)
HV-MMBench: Benchmarking MLLMs for Human-Centric Video Understanding [79.06209664703258]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos.<n>Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios.<n>We propose a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding.
arXiv Detail & Related papers (2025-07-07T11:52:24Z)
ImplicitQA: Going beyond frames towards Implicit Video Reasoning [39.63171940350552]
ImplicitQA is a novel benchmark designed to test VideoQA models on human-like implicit reasoning.<n>ImplicitQA comprises 1K meticulously annotated QA pairs drawn from 1K high-quality creative video clips.
arXiv Detail & Related papers (2025-06-26T19:53:54Z)
Fostering Video Reasoning via Next-Event Prediction [61.70045315542766]
We propose next-event prediction (NEP) as a learning task that harnesses future video segments as a rich, self-supervised signal to foster temporal reasoning.<n>To support this task, we curate V1-33K, a dataset comprising 33,000 automatically extracted video segments spanning diverse real-world scenarios.<n> Experiments validate that NEP offers a scalable and effective training paradigm for fostering temporal reasoning in MLLMs.
arXiv Detail & Related papers (2025-05-28T15:13:34Z)
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning? [56.06537213958482]
We present Video-Holmes, a benchmark designed to evaluate the complex video reasoning capabilities of MLLMs.<n>Video-Holmes consists of 1,837 questions derived from 270 manually annotated suspense short films.<n>Our comprehensive evaluation of state-of-the-art MLLMs reveals that, while these models generally excel at visual perception, they encounter substantial difficulties with integrating information.
arXiv Detail & Related papers (2025-05-27T16:05:01Z)
HiERO: understanding the hierarchy of human behavior enhances reasoning on egocentric videos [2.6749843984691672]
We present HiERO, a weakly-supervised method to enrich video segments features with hierarchical activity threads.<n>By aligning video clips with their narrated descriptions, HiERO infers contextual, semantic and temporal reasoning with an hierarchical architecture.
arXiv Detail & Related papers (2025-05-19T09:47:41Z)
Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events [33.51522765443546]
BlackSwanSuite is a benchmark for evaluating vision-language models' ability to reason about unexpected events.<n>We curate a comprehensive benchmark suite comprising over 3,800 MCQ, 4,900 generative and 6,700 yes/no questions, spanning 1,655 videos.<n>We find significant performance gaps of up to 32% from humans on these tasks.
arXiv Detail & Related papers (2024-12-07T19:19:03Z)
FIOVA: A Multi-Annotator Benchmark for Human-Aligned Video Captioning [15.363132825156477]
We introduce FIOVA, a human-centric benchmark tailored for evaluation of large vision-language models (LVLMs)<n>It comprises 3,002 real-world videos (about 33.6s each), each annotated independently by five annotators.<n>We propose FIOVA-DQ, an event-level evaluation metric that incorporates cognitive weights derived from annotator consensus.
arXiv Detail & Related papers (2024-10-20T03:59:54Z)
Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis [6.704529554100875]
Vision-language models (VLMs) have shown impressive zero- and few-shot performance on real-world visual question answering benchmarks. It remains unclear whether a VLM's apparent visual reasoning performance is due to its world knowledge, or due to actual visual reasoning capabilities.
arXiv Detail & Related papers (2024-08-27T14:43:54Z)
Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies [69.28082193942991]
This paper introduces a novel dataset, Tropes in Movies (TiM), designed as a testbed for exploring two critical yet previously overlooked video reasoning skills. utilizing tropes from movie storytelling, TiM evaluates the reasoning capabilities of state-of-the-art LLM-based approaches. To address these deficiencies, we propose Face-Enhanced Viper of Role Interactions (FEVoRI) and Context Query Reduction (ConQueR)
arXiv Detail & Related papers (2024-06-16T12:58:31Z)
VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment [19.313541287648473]
We introduce VELOCITI, a benchmark to study Video-LLMs by disentangling and assessing the comprehension of agents.<n>We adopt the Video-Language Entailment setup and propose StrictVLE that requires correct classification (rather than ranking) of the positive and negative caption.<n>Results show that action understanding lags behind agents, and negative captions created using entities appearing in the video perform worse than those obtained from pure text manipulation.
arXiv Detail & Related papers (2024-06-16T10:42:21Z)
NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory [92.98552727430483]
Narrations-as-Queries (NaQ) is a data augmentation strategy that transforms standard video-text narrations into training data for a video query localization model. NaQ improves multiple top models by substantial margins (even doubling their accuracy) We also demonstrate unique properties of our approach such as the ability to perform zero-shot and few-shot NLQ, and improved performance on queries about long-tail object categories.
arXiv Detail & Related papers (2023-01-02T16:40:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.