MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie
Understanding
- URL: http://arxiv.org/abs/2312.04817v1
- Date: Fri, 8 Dec 2023 03:33:38 GMT
- Title: MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie
Understanding
- Authors: Hongjie Zhang, Yi Liu, Lu Dong, Yifei Huang, Zhen-Hua Ling, Yali Wang,
Limin Wang, Yu Qiao
- Abstract summary: We introduce MoVQA, a long-form movie question-answering dataset.
We also benchmark to assess the diverse cognitive capabilities of multimodal systems.
- Score: 69.04413943858584
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While several long-form VideoQA datasets have been introduced, the length of
both videos used to curate questions and sub-clips of clues leveraged to answer
those questions have not yet reached the criteria for genuine long-form video
understanding. Moreover, their QAs are unduly narrow and modality-biased,
lacking a wider view of understanding long-term video content with rich
dynamics and complex narratives. To remedy this, we introduce MoVQA, a
long-form movie question-answering dataset, and benchmark to assess the diverse
cognitive capabilities of multimodal systems rely on multi-level temporal
lengths, with considering both video length and clue length. Additionally, to
take a step towards human-level understanding in long-form video, versatile and
multimodal question-answering is designed from the moviegoer-perspective to
assess the model capabilities on various perceptual and cognitive axes.Through
analysis involving various baselines reveals a consistent trend: the
performance of all methods significantly deteriorate with increasing video and
clue length. Meanwhile, our established baseline method has shown some
improvements, but there is still ample scope for enhancement on our challenging
MoVQA dataset. We expect our MoVQA to provide a new perspective and encourage
inspiring works on long-form video understanding research.
Related papers
- Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs [76.15356325947731]
We introduce Q-Bench-Video, a new benchmark specifically designed to evaluate LMMs' proficiency in discerning video quality.
We collect a total of 2,378 question-answer pairs and test them on 12 open-source & 5 proprietary LMMs.
Our findings indicate that while LMMs have a foundational understanding of video quality, their performance remains incomplete and imprecise, with a notable discrepancy compared to human performance.
arXiv Detail & Related papers (2024-09-30T08:05:00Z) - Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos [35.974750867072345]
This paper considers the problem of Multi-Hop Video Question Answering (MH-VidQA) in long-form egocentric videos.
We develop an automated pipeline to create multi-hop question-answering pairs with associated temporal evidence.
We then propose a novel architecture, termed as Grounding Scattered Evidence with Large Language Model (GeLM), that enhances multi-modal large language models.
arXiv Detail & Related papers (2024-08-26T17:58:47Z) - VideoQA in the Era of LLMs: An Empirical Study [108.37456450182054]
Video Large Language Models (Video-LLMs) are flourishing and has advanced many video-intuitive tasks.
This work conducts a timely and comprehensive study of Video-LLMs' behavior in VideoQA.
Our analyses demonstrate that Video-LLMs excel in VideoQA; they can correlate contextual cues and generate plausible responses to questions about varied video contents.
However, models falter in handling video temporality, both in reasoning about temporal content ordering and grounding QA-relevant temporal moments.
arXiv Detail & Related papers (2024-08-08T05:14:07Z) - LVBench: An Extreme Long Video Understanding Benchmark [38.839913137854104]
We introduce LVBench, a benchmark specifically designed for long video understanding.
Our dataset comprises publicly sourced videos and encompasses a diverse set of tasks aimed at long video comprehension and information extraction.
arXiv Detail & Related papers (2024-06-12T09:36:52Z) - CinePile: A Long Video Question Answering Dataset and Benchmark [55.30860239555001]
We present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding.
Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects.
We fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset.
arXiv Detail & Related papers (2024-05-14T17:59:02Z) - Query-aware Long Video Localization and Relation Discrimination for Deep
Video Understanding [15.697251303126874]
Deep Video Understanding (DVU) Challenge aims to push the boundaries of multimodal extraction, fusion, and analytics.
This paper introduces a query-aware method for long video localization and relation discrimination, leveraging an imagelanguage pretrained model.
Our approach achieved first and fourth positions for two groups of movie-level queries.
arXiv Detail & Related papers (2023-10-19T13:26:02Z) - MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form
Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA.
MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules.
Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z) - Video Question Answering: Datasets, Algorithms and Challenges [99.9179674610955]
Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos.
This paper provides a clear taxonomy and comprehensive analyses to VideoQA, focusing on the datasets, algorithms, and unique challenges.
arXiv Detail & Related papers (2022-03-02T16:34:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.