Related papers: Cinéaste: A Fine-grained Contextual Movie Question Answering Benchmark

Cinéaste: A Fine-grained Contextual Movie Question Answering Benchmark

URL: http://arxiv.org/abs/2509.14227v1
Date: Wed, 17 Sep 2025 17:58:06 GMT
Title: Cinéaste: A Fine-grained Contextual Movie Question Answering Benchmark
Authors: Nisarg A. Shah, Amir Ziai, Chaitanya Ekanadham, Vishal M. Patel,
Abstract summary: We introduce $mathsfCinacuteeaste$, a comprehensive benchmark for long-form movie understanding.<n>Our dataset comprises 3,119 multiple-choice question-answer pairs derived from 1,805 scenes across 200 movies.<n>Experiments show that existing MLLMs struggle on $mathsfCinacuteeaste$; our analysis reveals that long-range temporal reasoning is a primary bottleneck.
Score: 47.482960367243756
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: While recent advancements in vision-language models have improved video understanding, diagnosing their capacity for deep, narrative comprehension remains a challenge. Existing benchmarks often test short-clip recognition or use template-based questions, leaving a critical gap in evaluating fine-grained reasoning over long-form narrative content. To address these gaps, we introduce $\mathsf{Cin\acute{e}aste}$, a comprehensive benchmark for long-form movie understanding. Our dataset comprises 3,119 multiple-choice question-answer pairs derived from 1,805 scenes across 200 diverse movies, spanning five novel fine-grained contextual reasoning categories. We use GPT-4o to generate diverse, context-rich questions by integrating visual descriptions, captions, scene titles, and summaries, which require deep narrative understanding. To ensure high-quality evaluation, our pipeline incorporates a two-stage filtering process: Context-Independence filtering ensures questions require video context, while Contextual Veracity filtering validates factual consistency against the movie content, mitigating hallucinations. Experiments show that existing MLLMs struggle on $\mathsf{Cin\acute{e}aste}$; our analysis reveals that long-range temporal reasoning is a primary bottleneck, with the top open-source model achieving only 63.15\% accuracy. This underscores significant challenges in fine-grained contextual understanding and the need for advancements in long-form movie comprehension.

Related papers

MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark [32.452556002879255]
MovieRecapsQA is an open-ended VideoQA benchmark that supplies explicit textual context of the input.<n>Our benchmark provides videos of multiple lengths (i.e., recap-segments, movie-segments) and categorizations of questions (by modality and type) to enable fine-grained analysis.
arXiv Detail & Related papers (2026-01-05T20:17:25Z)
ImplicitQA: Going beyond frames towards Implicit Video Reasoning [39.63171940350552]
ImplicitQA is a novel benchmark designed to test VideoQA models on human-like implicit reasoning.<n>ImplicitQA comprises 1K meticulously annotated QA pairs drawn from 1K high-quality creative video clips.
arXiv Detail & Related papers (2025-06-26T19:53:54Z)
VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos [89.39873803375498]
VideoMathQA is a benchmark designed to evaluate whether models can perform temporally extended cross-modal reasoning on videos.<n>The benchmark spans 10 diverse mathematical domains, covering videos ranging from 10 seconds to over 1 hour.<n>It requires models to interpret structured visual content, understand instructional narratives, and jointly ground concepts across visual, audio, and textual modalities.
arXiv Detail & Related papers (2025-06-05T17:59:58Z)
TextVidBench: A Benchmark for Long Video Scene Text Understanding [60.94150574231576]
We introduce TextVidBench, the first benchmark specifically designed for long-video text question answering (>3 minutes)<n>TextVidBench makes three key contributions: Spanning 9 categories (e.g., news, sports, gaming), with an average video length of 2306 seconds, enabling more realistic evaluation of long-video understanding.<n>We propose an efficient paradigm for improving large models through: (i) introducing the IT-Rope mechanism and temporal prompt engineering to enhance temporal perception, (ii) adopting non-uniform positional encoding to better handle long video sequences, and (iii) applying lightweight fine-tuning on
arXiv Detail & Related papers (2025-06-05T12:54:56Z)
SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models [93.73583158211115]
Achieving fine-grained-temporal understanding in videos remains a major challenge for current Video Large Multimodels (Video LMMs)<n>We contribute in three core aspects: dataset, model, and benchmark.<n>First, we introduce SAMA-239K, a large-scale dataset comprising 15K videos specifically to enable joint learning of video understanding, grounding, and multi-turn video chat.<n>Second, we propose the SAMA model, which incorporates a versatile-temporal context aggregator and a Segment Model to jointly enhance fine-grained video comprehension and precise grounding capabilities.
arXiv Detail & Related papers (2025-05-24T18:13:16Z)
Movie101v2: Improved Movie Narration Benchmark [53.54176725112229]
Automatic movie narration aims to generate video-aligned plot descriptions to assist visually impaired audiences. We introduce Movie101v2, a large-scale, bilingual dataset with enhanced data quality specifically designed for movie narration. Based on our new benchmark, we baseline a range of large vision-language models, including GPT-4V, and conduct an in-depth analysis of the challenges in narration generation.
arXiv Detail & Related papers (2024-04-20T13:15:27Z)
LvBench: A Benchmark for Long-form Video Understanding with Versatile Multi-modal Question Answering [49.68215536040896]
LvBench is a long-form video understanding benchmark for versatile multi-modal question-answering.<n>We consider videos ranging from 70 seconds to 4 hours, covering single-scene, multi-scene, and full-scene contexts.<n>Our dataset comprises 20,061 question-answer pairs sourced from 100 carefully selected movies.
arXiv Detail & Related papers (2023-12-08T03:33:38Z)
Fill-in-the-blank as a Challenging Video Understanding Evaluation Framework [19.031957183047048]
We introduce a novel dataset consisting of 28,000 videos and fill-in-the-blank tests. We show that both a multimodal model and a strong language model have a large gap with human performance.
arXiv Detail & Related papers (2021-04-09T04:00:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.