MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie
Understanding
- URL: http://arxiv.org/abs/2312.04817v1
- Date: Fri, 8 Dec 2023 03:33:38 GMT
- Title: MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie
Understanding
- Authors: Hongjie Zhang, Yi Liu, Lu Dong, Yifei Huang, Zhen-Hua Ling, Yali Wang,
Limin Wang, Yu Qiao
- Abstract summary: We introduce MoVQA, a long-form movie question-answering dataset.
We also benchmark to assess the diverse cognitive capabilities of multimodal systems.
- Score: 69.04413943858584
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While several long-form VideoQA datasets have been introduced, the length of
both videos used to curate questions and sub-clips of clues leveraged to answer
those questions have not yet reached the criteria for genuine long-form video
understanding. Moreover, their QAs are unduly narrow and modality-biased,
lacking a wider view of understanding long-term video content with rich
dynamics and complex narratives. To remedy this, we introduce MoVQA, a
long-form movie question-answering dataset, and benchmark to assess the diverse
cognitive capabilities of multimodal systems rely on multi-level temporal
lengths, with considering both video length and clue length. Additionally, to
take a step towards human-level understanding in long-form video, versatile and
multimodal question-answering is designed from the moviegoer-perspective to
assess the model capabilities on various perceptual and cognitive axes.Through
analysis involving various baselines reveals a consistent trend: the
performance of all methods significantly deteriorate with increasing video and
clue length. Meanwhile, our established baseline method has shown some
improvements, but there is still ample scope for enhancement on our challenging
MoVQA dataset. We expect our MoVQA to provide a new perspective and encourage
inspiring works on long-form video understanding research.
Related papers
- SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis [52.050036778325094]
We introduce SALOVA: Segment-Augmented Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content.
We present a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich context.
Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries.
arXiv Detail & Related papers (2024-11-25T08:04:47Z) - Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs [76.15356325947731]
We introduce Q-Bench-Video, a new benchmark specifically designed to evaluate LMMs' proficiency in discerning video quality.
We collect a total of 2,378 question-answer pairs and test them on 12 open-source & 5 proprietary LMMs.
Our findings indicate that while LMMs have a foundational understanding of video quality, their performance remains incomplete and imprecise, with a notable discrepancy compared to human performance.
arXiv Detail & Related papers (2024-09-30T08:05:00Z) - LVBench: An Extreme Long Video Understanding Benchmark [38.839913137854104]
We introduce LVBench, a benchmark specifically designed for long video understanding.
Our dataset comprises publicly sourced videos and encompasses a diverse set of tasks aimed at long video comprehension and information extraction.
arXiv Detail & Related papers (2024-06-12T09:36:52Z) - CinePile: A Long Video Question Answering Dataset and Benchmark [55.30860239555001]
We present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding.
Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects.
We fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset.
arXiv Detail & Related papers (2024-05-14T17:59:02Z) - Query-aware Long Video Localization and Relation Discrimination for Deep
Video Understanding [15.697251303126874]
Deep Video Understanding (DVU) Challenge aims to push the boundaries of multimodal extraction, fusion, and analytics.
This paper introduces a query-aware method for long video localization and relation discrimination, leveraging an imagelanguage pretrained model.
Our approach achieved first and fourth positions for two groups of movie-level queries.
arXiv Detail & Related papers (2023-10-19T13:26:02Z) - MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form
Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA.
MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules.
Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z) - Video Question Answering: Datasets, Algorithms and Challenges [99.9179674610955]
Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos.
This paper provides a clear taxonomy and comprehensive analyses to VideoQA, focusing on the datasets, algorithms, and unique challenges.
arXiv Detail & Related papers (2022-03-02T16:34:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.