MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie
Understanding
- URL: http://arxiv.org/abs/2312.04817v1
- Date: Fri, 8 Dec 2023 03:33:38 GMT
- Title: MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie
Understanding
- Authors: Hongjie Zhang, Yi Liu, Lu Dong, Yifei Huang, Zhen-Hua Ling, Yali Wang,
Limin Wang, Yu Qiao
- Abstract summary: We introduce MoVQA, a long-form movie question-answering dataset.
We also benchmark to assess the diverse cognitive capabilities of multimodal systems.
- Score: 69.04413943858584
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While several long-form VideoQA datasets have been introduced, the length of
both videos used to curate questions and sub-clips of clues leveraged to answer
those questions have not yet reached the criteria for genuine long-form video
understanding. Moreover, their QAs are unduly narrow and modality-biased,
lacking a wider view of understanding long-term video content with rich
dynamics and complex narratives. To remedy this, we introduce MoVQA, a
long-form movie question-answering dataset, and benchmark to assess the diverse
cognitive capabilities of multimodal systems rely on multi-level temporal
lengths, with considering both video length and clue length. Additionally, to
take a step towards human-level understanding in long-form video, versatile and
multimodal question-answering is designed from the moviegoer-perspective to
assess the model capabilities on various perceptual and cognitive axes.Through
analysis involving various baselines reveals a consistent trend: the
performance of all methods significantly deteriorate with increasing video and
clue length. Meanwhile, our established baseline method has shown some
improvements, but there is still ample scope for enhancement on our challenging
MoVQA dataset. We expect our MoVQA to provide a new perspective and encourage
inspiring works on long-form video understanding research.
Related papers
- Open-Ended and Knowledge-Intensive Video Question Answering [20.256081440725353]
We investigate knowledge-intensive video question answering (KI-VideoQA) through the lens of multi-modal retrieval-augmented generation.
Our analysis examines various retrieval augmentation approaches using cutting-edge retrieval and vision language models.
We achieve a substantial 17.5% improvement in accuracy on multiple choice questions in the KnowIT VQA dataset.
arXiv Detail & Related papers (2025-02-17T12:40:35Z) - HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding [52.696422425058245]
We build a large-scale hour-long long video benchmark, HLV-1K, designed to evaluate long video understanding models.
HLV-1K comprises 1009 hour-long videos with 14,847 high-quality question answering (QA) and multi-choice question asnwering (MCQA)
We evaluate our benchmark using existing state-of-the-art methods and demonstrate its value for testing deep long video understanding capabilities at different levels and for various tasks.
arXiv Detail & Related papers (2025-01-03T05:32:37Z) - Perceive, Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries [50.47265863322891]
Video Question Answering (Video QA) is a challenging video understanding task that requires models to comprehend entire videos.
Recent advancements in Multimodal Large Language Models (MLLMs) have transformed video QA by leveraging their exceptional commonsense reasoning capabilities.
We propose T-Former, a novel temporal modeling method that creates a question-guided temporal bridge between frame-wise visual perception and the reasoning capabilities of LLMs.
arXiv Detail & Related papers (2024-12-26T17:53:14Z) - SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis [52.050036778325094]
We introduce SALOVA: Segment-Augmented Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content.
We present a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich context.
Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries.
arXiv Detail & Related papers (2024-11-25T08:04:47Z) - LVBench: An Extreme Long Video Understanding Benchmark [38.839913137854104]
We introduce LVBench, a benchmark specifically designed for long video understanding.
Our dataset comprises publicly sourced videos and encompasses a diverse set of tasks aimed at long video comprehension and information extraction.
arXiv Detail & Related papers (2024-06-12T09:36:52Z) - CinePile: A Long Video Question Answering Dataset and Benchmark [55.30860239555001]
We present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding.
Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects.
We fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset.
arXiv Detail & Related papers (2024-05-14T17:59:02Z) - MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form
Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA.
MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules.
Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.