Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos
- URL: http://arxiv.org/abs/2503.13646v1
- Date: Mon, 17 Mar 2025 18:50:36 GMT
- Title: Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos
- Authors: Chiara Plizzari, Alessio Tonioni, Yongqin Xian, Achin Kulshrestha, Federico Tombari,
- Abstract summary: EgoTempo is a dataset designed to evaluate temporal understanding in the egocentric domain.<n>We show that state-of-the-art Multi-Modal Large Language Models (MLLMs) on benchmarks achieve remarkably high performance using just text or a single frame as input.<n>We hope EgoTempo will catalyze new research in the field and inspire models that better capture the complexity of temporal dynamics.
- Score: 51.8995932557911
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding fine-grained temporal dynamics is crucial in egocentric videos, where continuous streams capture frequent, close-up interactions with objects. In this work, we bring to light that current egocentric video question-answering datasets often include questions that can be answered using only few frames or commonsense reasoning, without being necessarily grounded in the actual video. Our analysis shows that state-of-the-art Multi-Modal Large Language Models (MLLMs) on these benchmarks achieve remarkably high performance using just text or a single frame as input. To address these limitations, we introduce EgoTempo, a dataset specifically designed to evaluate temporal understanding in the egocentric domain. EgoTempo emphasizes tasks that require integrating information across the entire video, ensuring that models would need to rely on temporal patterns rather than static cues or pre-existing knowledge. Extensive experiments on EgoTempo show that current MLLMs still fall short in temporal reasoning on egocentric videos, and thus we hope EgoTempo will catalyze new research in the field and inspire models that better capture the complexity of temporal dynamics. Dataset and code are available at https://github.com/google-research-datasets/egotempo.git.
Related papers
- MINERVA: Evaluating Complex Video Reasoning [72.12644008002566]
We provide a new video reasoning dataset called MINERVA for modern multimodal models.
Our dataset is multimodal, diverse in terms of video domain and length, and consists of complex multi-step questions.
We perform fine-grained error analysis to identify common failure modes across various models, and create a taxonomy of reasoning errors.
arXiv Detail & Related papers (2025-05-01T17:41:49Z) - EgoToM: Benchmarking Theory of Mind Reasoning from Egocentric Videos [26.930652137352197]
We introduce EgoToM, a new video question-answering benchmark that extends Theory-of-Mind evaluation to egocentric domains.
Using a causal ToM model, we generate multi-choice video QA instances for the Ego4D dataset to benchmark the ability to predict a camera wearer's goals, beliefs, and next actions.
We study the performance of both humans and state of the art multimodal large language models (MLLMs) on these three interconnected inference problems.
arXiv Detail & Related papers (2025-03-28T05:10:59Z) - V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning [40.18308199837137]
We introduce a Video S-Temporal Reasoning (V-STa) benchmark to address these shortcomings.
We construct a dataset to elicit the spatial-temporal reasoning process of Video-LLMs.
Experiments from 14 Video-LLMs reveal significant gaps between current Video-LLMs and the needs for robust and consistent consistent reasoning.
arXiv Detail & Related papers (2025-03-14T15:21:44Z) - Do Language Models Understand Time? [2.290956583394892]
Large language models (LLMs) have revolutionized video-based computer vision applications, including action recognition, anomaly detection, and summarization.<n>This work critically examines the role of LLMs in video processing, with a specific focus on their temporal reasoning capabilities.<n>We analyze challenges posed by existing video datasets, including biases, lack of temporal annotations, and domain-specific limitations that constrain the temporal understanding of LLMs.
arXiv Detail & Related papers (2024-12-18T13:38:06Z) - TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models [75.42002690128486]
TemporalBench is a new benchmark dedicated to evaluating fine-grained temporal understanding in videos.
It consists of 10K video question-answer pairs, derived from 2K high-quality human annotations detailing the temporal dynamics in video clips.
Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench.
arXiv Detail & Related papers (2024-10-14T17:59:58Z) - MM-Ego: Towards Building Egocentric Multimodal LLMs [72.47344411599322]
This research aims to explore building a multimodal foundation model for egocentric video understanding.
We develop a data engine that efficiently generates 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long, based on human-annotated data.
We contribute a challenging egocentric QA benchmark with 629 videos and 7,026 questions to evaluate the models' ability in recognizing and memorizing visual details across videos of varying lengths.
arXiv Detail & Related papers (2024-10-09T17:59:59Z) - Do Egocentric Video-Language Models Truly Understand Hand-Object Interactions? [48.702973928321946]
Egocentric video-language pretraining is a crucial step in advancing the understanding of hand-object interactions in first-person scenarios.<n>Despite successes on existing testbeds, we find that current EgoVLMs can be easily misled by simple modifications.<n>This raises the question: Do EgoVLMs truly understand hand-object interactions?
arXiv Detail & Related papers (2024-05-28T00:27:29Z) - EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language
Understanding [53.275916136138996]
Ego is a very long-form video question-answering dataset, spanning over 250 hours of real video data.
For each question, Ego requires the correct answer to be selected between five given options based on a three-minute-long video clip.
We find Ego to have intrinsic temporal lengths over 5.7x longer than the second closest dataset and 10x longer than any other video understanding dataset.
arXiv Detail & Related papers (2023-08-17T17:59:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.