Short Film Dataset (SFD): A Benchmark for Story-Level Video Understanding
- URL: http://arxiv.org/abs/2406.10221v1
- Date: Fri, 14 Jun 2024 17:54:54 GMT
- Title: Short Film Dataset (SFD): A Benchmark for Story-Level Video Understanding
- Authors: Ridouane Ghermi, Xi Wang, Vicky Kalogeiton, Ivan Laptev,
- Abstract summary: We propose the Short Film dataset with 1,078 publicly available amateur movies.
Our experiments emphasize the need for long-term reasoning to solve SFD tasks.
We show significantly lower performance of current models compared to people when using vision data alone.
- Score: 30.06191555110948
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in vision-language models have significantly propelled video understanding. Existing datasets and tasks, however, have notable limitations. Most datasets are confined to short videos with limited events and narrow narratives. For example, datasets with instructional and egocentric videos often document the activities of one person in a single scene. Although some movie datasets offer richer content, they are often limited to short-term tasks, lack publicly available videos and frequently encounter data leakage given the use of movie forums and other resources in LLM training. To address the above limitations, we propose the Short Film Dataset (SFD) with 1,078 publicly available amateur movies, a wide variety of genres and minimal data leakage issues. SFD offers long-term story-oriented video tasks in the form of multiple-choice and open-ended question answering. Our extensive experiments emphasize the need for long-term reasoning to solve SFD tasks. Notably, we find strong signals in movie transcripts leading to the on-par performance of people and LLMs. We also show significantly lower performance of current models compared to people when using vision data alone.
Related papers
- HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding [52.696422425058245]
We build a large-scale hour-long long video benchmark, HLV-1K, designed to evaluate long video understanding models.
HLV-1K comprises 1009 hour-long videos with 14,847 high-quality question answering (QA) and multi-choice question asnwering (MCQA)
We evaluate our benchmark using existing state-of-the-art methods and demonstrate its value for testing deep long video understanding capabilities at different levels and for various tasks.
arXiv Detail & Related papers (2025-01-03T05:32:37Z) - VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling [43.485687038460895]
Long-context video modeling is critical for multimodal large language models (MLLMs)
This paper aims to address this issue from aspects of model architecture, training data, training strategy and evaluation benchmark.
We build a powerful video MLLM named VideoChat-Flash, which shows a leading performance on both mainstream long and short video benchmarks.
arXiv Detail & Related papers (2024-12-31T18:01:23Z) - Video Repurposing from User Generated Content: A Large-scale Dataset and Benchmark [5.76230561819199]
We propose Repurpose-10K, an extensive dataset comprising over 10,000 videos with more than 120,000 annotated clips.
We propose a two-stage solution to obtain annotations from real-world user-generated content.
We offer a baseline model to address this challenging task by integrating audio, visual, and caption aspects.
arXiv Detail & Related papers (2024-12-12T02:27:46Z) - MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation [62.85764872989189]
There is no publicly available dataset tailored for the analysis, evaluation, and training of long video generation models.
We present MovieBench: A Hierarchical Movie-Level dataset for Long Video Generation.
The dataset will be public and continuously maintained, aiming to advance the field of long video generation.
arXiv Detail & Related papers (2024-11-22T10:25:08Z) - LVBench: An Extreme Long Video Understanding Benchmark [38.839913137854104]
We introduce LVBench, a benchmark specifically designed for long video understanding.
Our dataset comprises publicly sourced videos and encompasses a diverse set of tasks aimed at long video comprehension and information extraction.
arXiv Detail & Related papers (2024-06-12T09:36:52Z) - CinePile: A Long Video Question Answering Dataset and Benchmark [55.30860239555001]
We present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding.
Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects.
We fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset.
arXiv Detail & Related papers (2024-05-14T17:59:02Z) - MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding [66.56100008577134]
This study focuses on designing an efficient and effective model for long-term video understanding.
We propose to process videos in an online manner and store past video information in a memory bank.
Our model can achieve state-of-the-art performances across multiple datasets.
arXiv Detail & Related papers (2024-04-08T17:59:24Z) - Koala: Key frame-conditioned long video-LLM [70.52369588364992]
We propose a lightweight and self-supervised long video-LLM (Koala) to adapt pretrained vLLMs for generalizing to longer videos.
Our approach outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across all tasks.
Surprisingly, we also empirically show that our approach not only helps a pretrained vLLM to understand long videos but also improves its accuracy on short-term action recognition.
arXiv Detail & Related papers (2024-04-05T18:33:04Z) - Scaling Up Video Summarization Pretraining with Large Language Models [73.74662411006426]
We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset.
We analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them.
Our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals.
arXiv Detail & Related papers (2024-04-04T11:59:06Z) - Long Story Short: a Summarize-then-Search Method for Long Video Question
Answering [23.094728230459125]
We investigate if language models can extend their zero-shot reasoning abilities to long multimodal narratives in multimedia content.
We propose Long Story Short, a framework for narrative video QA that first summarizes the narrative of the video to a short plot and then searches parts of the video relevant to the question.
Our model outperforms state-of-the-art supervised models by a large margin, highlighting the potential of zero-shot QA for long videos.
arXiv Detail & Related papers (2023-11-02T13:36:11Z) - HierVL: Learning Hierarchical Video-Language Embeddings [108.77600799637172]
HierVL is a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations.
We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level.
Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA.
arXiv Detail & Related papers (2023-01-05T21:53:19Z) - MAD: A Scalable Dataset for Language Grounding in Videos from Movie
Audio Descriptions [109.84031235538002]
We present MAD (Movie Audio Descriptions), a novel benchmark that departs from the paradigm of augmenting existing video datasets with text annotations.
MAD contains over 384,000 natural language sentences grounded in over 1,200 hours of video and exhibits a significant reduction in the currently diagnosed biases for video-language grounding datasets.
arXiv Detail & Related papers (2021-12-01T11:47:09Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z) - VIOLIN: A Large-Scale Dataset for Video-and-Language Inference [103.7457132841367]
We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text.
Given a video clip with subtitles aligned as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip.
A new large-scale dataset, named Violin (VIdeO-and-Language INference), is introduced for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips.
arXiv Detail & Related papers (2020-03-25T20:39:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.