ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models
- URL: http://arxiv.org/abs/2506.21356v2
- Date: Fri, 27 Jun 2025 05:10:18 GMT
- Title: ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models
- Authors: Hongbo Liu, Jingwen He, Yi Jin, Dian Zheng, Yuhao Dong, Fan Zhang, Ziqi Huang, Yinan He, Yangguang Li, Weichao Chen, Yu Qiao, Wanli Ouyang, Shengjie Zhao, Ziwei Liu,
- Abstract summary: We introduce ShotBench, a benchmark specifically designed for cinematic language understanding.<n>It features over 3.5k expert-annotated QA pairs from images and video clips, meticulously curated from over 200 acclaimed (predominantly Oscar-nominated) films.<n>Our evaluation of 24 leading Vision-Language Models on ShotBench reveals their substantial limitations, particularly struggling with fine-grained visual cues and complex spatial reasoning.
- Score: 87.43784424444128
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cinematography, the fundamental visual language of film, is essential for conveying narrative, emotion, and aesthetic quality. While recent Vision-Language Models (VLMs) demonstrate strong general visual understanding, their proficiency in comprehending the nuanced cinematic grammar embedded within individual shots remains largely unexplored and lacks robust evaluation. This critical gap limits both fine-grained visual comprehension and the precision of AI-assisted video generation. To address this, we introduce ShotBench, a comprehensive benchmark specifically designed for cinematic language understanding. It features over 3.5k expert-annotated QA pairs from images and video clips, meticulously curated from over 200 acclaimed (predominantly Oscar-nominated) films and spanning eight key cinematography dimensions. Our evaluation of 24 leading VLMs on ShotBench reveals their substantial limitations: even the top-performing model achieves less than 60% average accuracy, particularly struggling with fine-grained visual cues and complex spatial reasoning. To catalyze advancement in this domain, we construct ShotQA, a large-scale multimodal dataset comprising approximately 70k cinematic QA pairs. Leveraging ShotQA, we develop ShotVL through supervised fine-tuning and Group Relative Policy Optimization. ShotVL significantly outperforms all existing open-source and proprietary models on ShotBench, establishing new state-of-the-art performance. We open-source our models, data, and code to foster rapid progress in this crucial area of AI-driven cinematic understanding and generation.
Related papers
- VideoMolmo: Spatio-Temporal Grounding Meets Pointing [73.25506085339252]
VideoMolmo is a model tailored for fine-grained pointing of video sequences.<n>A novel temporal mask fusion employs SAM2 for bidirectional point propagation.<n>To evaluate the generalization of VideoMolmo, we introduce VPoMolS-temporal, a challenging out-of-distribution benchmark spanning five real-world scenarios.
arXiv Detail & Related papers (2025-06-05T17:59:29Z) - VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation [23.701884816475403]
Video captions play a crucial role in text-to-video generation tasks.<n>Existing benchmarks inadequately address fine-grained evaluation.<n>We introduce the Fine-grained Video Caption Evaluation Benchmark (VCapsBench)
arXiv Detail & Related papers (2025-05-29T14:34:25Z) - CineTechBench: A Benchmark for Cinematographic Technique Understanding and Generation [22.88243961225531]
CineTechBench is a benchmark founded on precise, manual annotation by seasoned cinematography experts.<n>Our benchmark covers seven essential aspects-shot scale, shot angle, composition, camera movement, lighting, color, and focal length.<n>For the generation task, we assess advanced video generation models on their capacity to reconstruct cinema-quality camera movements.
arXiv Detail & Related papers (2025-05-21T06:02:39Z) - VideoMultiAgents: A Multi-Agent Framework for Video Question Answering [11.514596823413736]
Video Question Answering (VQA) inherently relies on multimodal reasoning.<n>We introduce VideoMultiAgents, a framework that integrates specialized agents for vision, scene graph analysis, and text processing.<n>Our approach is also supplemented with a question-guided caption generation, which produces captions that highlight objects, actions, and temporal transitions.
arXiv Detail & Related papers (2025-04-25T22:08:09Z) - Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing [30.94114120434789]
We propose KVTP (Keyframe-oriented Vision Token MME), a novel framework that overcomes the token pruning and selection drawbacks.<n> KVTP effectively retains essential contextual information while significantly reducing redundant computation.
arXiv Detail & Related papers (2025-03-13T17:47:52Z) - FaVChat: Unlocking Fine-Grained Facial Video Understanding with Multimodal Large Language Models [12.029771909598647]
FaVChat is the first VMLLM specifically designed for fine-grained facial video understanding.<n>We construct a large-scale facial video dataset comprising over 60k videos, with the majority annotated with 83 fine-grained facial attributes.<n>We employ a progressive training paradigm, transitioning from video summarization to a high-quality subset of video QA, gradually increasing task complexity to enhance the model's fine-grained visual perception.
arXiv Detail & Related papers (2025-03-12T08:33:46Z) - Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering [71.62961521518731]
HeurVidQA is a framework that leverages domain-specific entity-actions to refine pre-trained video-language foundation models.
Our approach treats these models as implicit knowledge engines, employing domain-specific entity-action prompters to direct the model's focus toward precise cues that enhance reasoning.
arXiv Detail & Related papers (2024-10-12T06:22:23Z) - CLIPVQA:Video Quality Assessment via CLIP [56.94085651315878]
We propose an efficient CLIP-based Transformer method for the VQA problem ( CLIPVQA)
The proposed CLIPVQA achieves new state-of-the-art VQA performance and up to 37% better generalizability than existing benchmark VQA methods.
arXiv Detail & Related papers (2024-07-06T02:32:28Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [63.14000659130736]
We introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench.
We first introduce a novel static-to-dynamic method to define these temporal-related tasks.
Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task.
arXiv Detail & Related papers (2023-11-28T17:59:04Z) - Let's Think Frame by Frame with VIP: A Video Infilling and Prediction
Dataset for Evaluating Video Chain-of-Thought [62.619076257298204]
We motivate framing video reasoning as the sequential understanding of a small number of video reasonings.
We introduce VIP, an inference-time challenge dataset designed to explore models' reasoning capabilities through video chain-of-thought.
We benchmark GPT-4, GPT-3, and VICUNA on VIP, demonstrate the performance gap in complex video reasoning tasks, and encourage future work.
arXiv Detail & Related papers (2023-05-23T10:26:42Z) - Bidirectional Cross-Modal Knowledge Exploration for Video Recognition
with Pre-trained Vision-Language Models [149.1331903899298]
We propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge.
We present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner.
Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model.
arXiv Detail & Related papers (2022-12-31T11:36:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.