Related papers: EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding

EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding

URL: http://arxiv.org/abs/2308.09126v1
Date: Thu, 17 Aug 2023 17:59:59 GMT
Title: EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding
Authors: Karttikeya Mangalam, Raiymbek Akshulakov, Jitendra Malik
Abstract summary: Ego is a very long-form video question-answering dataset, spanning over 250 hours of real video data. For each question, Ego requires the correct answer to be selected between five given options based on a three-minute-long video clip. We find Ego to have intrinsic temporal lengths over 5.7x longer than the second closest dataset and 10x longer than any other video understanding dataset.
Score: 53.275916136138996
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce EgoSchema, a very long-form video question-answering dataset, and benchmark to evaluate long video understanding capabilities of modern vision and language systems. Derived from Ego4D, EgoSchema consists of over 5000 human curated multiple choice question answer pairs, spanning over 250 hours of real video data, covering a very broad range of natural human activity and behavior. For each question, EgoSchema requires the correct answer to be selected between five given options based on a three-minute-long video clip. While some prior works have proposed video datasets with long clip lengths, we posit that merely the length of the video clip does not truly capture the temporal difficulty of the video task that is being considered. To remedy this, we introduce temporal certificate sets, a general notion for capturing the intrinsic temporal understanding length associated with a broad range of video understanding tasks & datasets. Based on this metric, we find EgoSchema to have intrinsic temporal lengths over 5.7x longer than the second closest dataset and 10x to 100x longer than any other video understanding dataset. Further, our evaluation of several current state-of-the-art video and language models shows them to be severely lacking in long-term video understanding capabilities. Even models with several billions of parameters achieve QA accuracy less than 33% (random is 20%) on the EgoSchema multi-choice question answering task, while humans achieve about 76% accuracy. We posit that \name{}{}, with its long intrinsic temporal structures and diverse complexity, would serve as a valuable evaluation probe for developing effective long-term video understanding systems in the future. Data and Zero-shot model evaluation code are open-sourced for both public and commercial use under the Ego4D license at http://egoschema.github.io

Related papers

Vidi: Large Multimodal Models for Video Understanding and Editing [33.56852569192024]
We introduce Vidi, a family of Large Multimodal Models (LMMs) for a wide range of video understand editing scenarios. The first release focuses on temporal retrieval, identifying the time ranges within the input videos corresponding to a given text query. We also present the VUE-TR benchmark, which introduces five key advancements.
arXiv Detail & Related papers (2025-04-22T08:04:45Z)
Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos [51.8995932557911]
EgoTempo is a dataset designed to evaluate temporal understanding in the egocentric domain. We show that state-of-the-art Multi-Modal Large Language Models (MLLMs) on benchmarks achieve remarkably high performance using just text or a single frame as input. We hope EgoTempo will catalyze new research in the field and inspire models that better capture the complexity of temporal dynamics.
arXiv Detail & Related papers (2025-03-17T18:50:36Z)
X-LeBench: A Benchmark for Extremely Long Egocentric Video Understanding [25.85614872348223]
Long-form egocentric video understanding provides rich contextual information and insights into long-term human behaviors. Existing benchmark datasets primarily focus on single, short-duration videos or moderately long videos up to dozens of minutes. We introduce X-LeBench, a novel benchmark dataset specifically crafted for evaluating tasks on extremely long egocentric video recordings.
arXiv Detail & Related papers (2025-01-12T15:07:03Z)
HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding [52.696422425058245]
We build a large-scale hour-long long video benchmark, HLV-1K, designed to evaluate long video understanding models. HLV-1K comprises 1009 hour-long videos with 14,847 high-quality question answering (QA) and multi-choice question asnwering (MCQA) We evaluate our benchmark using existing state-of-the-art methods and demonstrate its value for testing deep long video understanding capabilities at different levels and for various tasks.
arXiv Detail & Related papers (2025-01-03T05:32:37Z)
Neptune: The Long Orbit to Benchmarking Long Video Understanding [73.96154871970062]
We introduce Neptune, a benchmark for long video understanding that requires reasoning over long time horizons and across different modalities. Our dataset covers a broad range of long video reasoning abilities and consists of a subset that emphasizes multimodal reasoning. Benchmark evaluations reveal that most current open-source long video models perform poorly on Neptune.
arXiv Detail & Related papers (2024-12-12T18:54:48Z)
HourVideo: 1-Hour Video-Language Understanding [34.90495038962066]
HourVideo is a benchmark dataset for hour-long video-language understanding. HourVideo includes 500 manually curated egocentric videos spanning durations of 20 to 120 minutes. Benchmarking results reveal that multimodal models, including GPT-4 and LLaVA-NeXT, achieve marginal improvements over random chance.
arXiv Detail & Related papers (2024-11-07T18:59:16Z)
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models [75.42002690128486]
TemporalBench is a new benchmark dedicated to evaluating fine-grained temporal understanding in videos. It consists of 10K video question-answer pairs, derived from 2K high-quality human annotations detailing the temporal dynamics in video clips. Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench.
arXiv Detail & Related papers (2024-10-14T17:59:58Z)
MM-Ego: Towards Building Egocentric Multimodal LLMs [72.47344411599322]
This research aims to explore building a multimodal foundation model for egocentric video understanding. We develop a data engine that efficiently generates 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long, based on human-annotated data. We contribute a challenging egocentric QA benchmark with 629 videos and 7,026 questions to evaluate the models' ability in recognizing and memorizing visual details across videos of varying lengths.
arXiv Detail & Related papers (2024-10-09T17:59:59Z)
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding [41.9477837230283]
LongVideoBench is a question-answering benchmark that features video-language interleaved inputs up to an hour long. Our benchmark includes 3,763 varying-length web-collected videos with their subtitles across diverse themes. We formulate a novel video question-answering task termed referring reasoning.
arXiv Detail & Related papers (2024-07-22T16:00:55Z)
Goldfish: Vision-Language Understanding of Arbitrarily Long Videos [51.547065479762715]
We present a methodology tailored for comprehending videos of arbitrary lengths. We also introduce the TVQA-long benchmark, designed to evaluate models' capabilities in understanding long videos with questions in both vision and text content. Our results indicate that our models have significant improvements in both long and short-video understanding.
arXiv Detail & Related papers (2024-07-17T15:59:32Z)
LVBench: An Extreme Long Video Understanding Benchmark [38.839913137854104]
We introduce LVBench, a benchmark specifically designed for long video understanding. Our dataset comprises publicly sourced videos and encompasses a diverse set of tasks aimed at long video comprehension and information extraction.
arXiv Detail & Related papers (2024-06-12T09:36:52Z)
CinePile: A Long Video Question Answering Dataset and Benchmark [55.30860239555001]
We present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects. We fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset.
arXiv Detail & Related papers (2024-05-14T17:59:02Z)
Koala: Key frame-conditioned long video-LLM [70.52369588364992]
We propose a lightweight and self-supervised long video-LLM (Koala) to adapt pretrained vLLMs for generalizing to longer videos. Our approach outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across all tasks. Surprisingly, we also empirically show that our approach not only helps a pretrained vLLM to understand long videos but also improves its accuracy on short-term action recognition.
arXiv Detail & Related papers (2024-04-05T18:33:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.