Neptune: The Long Orbit to Benchmarking Long Video Understanding
- URL: http://arxiv.org/abs/2412.09582v2
- Date: Sat, 18 Jan 2025 00:52:42 GMT
- Title: Neptune: The Long Orbit to Benchmarking Long Video Understanding
- Authors: Arsha Nagrani, Mingda Zhang, Ramin Mehran, Rachel Hornung, Nitesh Bharadwaj Gundavarapu, Nilpa Jha, Austin Myers, Xingyi Zhou, Boqing Gong, Cordelia Schmid, Mikhail Sirotenko, Yukun Zhu, Tobias Weyand,
- Abstract summary: We introduce Neptune, a benchmark for long video understanding that requires reasoning over long time horizons and across different modalities.<n>Our dataset covers a broad range of long video reasoning abilities and consists of a subset that emphasizes multimodal reasoning.<n> Benchmark evaluations reveal that most current open-source long video models perform poorly on Neptune.
- Score: 73.96154871970062
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Neptune, a benchmark for long video understanding that requires reasoning over long time horizons and across different modalities. Many existing video datasets and models are focused on short clips (10s-30s). While some long video datasets do exist, they can often be solved by powerful image models applied per frame (and often to very few frames) in a video, and are usually manually annotated at high cost. In order to mitigate both these problems, we propose a scalable dataset creation pipeline which leverages large models (VLMs and LLMs), to automatically generate dense, time-aligned video captions, as well as tough question answer decoy sets for video segments (up to 15 minutes in length). Our dataset Neptune covers a broad range of long video reasoning abilities and consists of a subset that emphasizes multimodal reasoning. Since existing metrics for open-ended question answering are either rule-based or may rely on proprietary models, we provide a new open source model-based metric GEM to score open-ended responses on Neptune. Benchmark evaluations reveal that most current open-source long video models perform poorly on Neptune, particularly on questions testing temporal ordering, counting and state changes. Through Neptune, we aim to spur the development of more advanced models capable of understanding long videos. The dataset is available at https://github.com/google-deepmind/neptune
Related papers
- MINERVA: Evaluating Complex Video Reasoning [72.12644008002566]
We provide a new video reasoning dataset called MINERVA for modern multimodal models.
Our dataset is multimodal, diverse in terms of video domain and length, and consists of complex multi-step questions.
We perform fine-grained error analysis to identify common failure modes across various models, and create a taxonomy of reasoning errors.
arXiv Detail & Related papers (2025-05-01T17:41:49Z) - Lost in Time: A New Temporal Benchmark for VideoLLMs [48.71203934876828]
We show that the currently most used video-language benchmarks can be solved without requiring much temporal reasoning.
We propose TVBench, a novel open-source video multiple-choice question-answering benchmark.
arXiv Detail & Related papers (2024-10-10T09:28:36Z) - LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding [41.9477837230283]
LongVideoBench is a question-answering benchmark that features video-language interleaved inputs up to an hour long.
Our benchmark includes 3,763 varying-length web-collected videos with their subtitles across diverse themes.
We formulate a novel video question-answering task termed referring reasoning.
arXiv Detail & Related papers (2024-07-22T16:00:55Z) - Goldfish: Vision-Language Understanding of Arbitrarily Long Videos [51.547065479762715]
We present a methodology tailored for comprehending videos of arbitrary lengths.
We also introduce the TVQA-long benchmark, designed to evaluate models' capabilities in understanding long videos with questions in both vision and text content.
Our results indicate that our models have significant improvements in both long and short-video understanding.
arXiv Detail & Related papers (2024-07-17T15:59:32Z) - Hallucination Mitigation Prompts Long-term Video Understanding [36.26790392889717]
This paper constructs a comprehensive hallucination mitigation pipeline based on existing MLLMs.
We use the CLIP Score to guide the frame sampling process with questions, selecting key frames relevant to the question.
During the answer generation stage, we utilize chain-of-thought and in-context learning techniques to explicitly control the generation of answers.
arXiv Detail & Related papers (2024-06-17T08:44:03Z) - Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs [20.168429351519055]
Video understanding is a crucial next step for multimodal large language models (LMLMs)
We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation.
We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities.
arXiv Detail & Related papers (2024-06-13T17:50:05Z) - LongVLM: Efficient Long Video Understanding via Large Language Models [55.813206751150716]
LongVLM is a simple yet powerful VideoLLM for long video understanding.
We encode video representations that incorporate both local and global information.
Our model produces more precise responses for long video understanding.
arXiv Detail & Related papers (2024-04-04T11:33:29Z) - STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results
for Video Question Answering [42.173245795917026]
We propose STAIR, a Spatial-Temporal Reasoning model with Auditable Intermediate Results for video question answering.
STAIR is a neural module network, which contains a program generator to decompose a given question into a hierarchical combination of several sub-tasks.
We conduct extensive experiments on several video question answering datasets to show STAIR's performance, explainability, compatibility with pre-trained models, and applicability when program annotations are not available.
arXiv Detail & Related papers (2024-01-08T14:01:59Z) - EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language
Understanding [53.275916136138996]
Ego is a very long-form video question-answering dataset, spanning over 250 hours of real video data.
For each question, Ego requires the correct answer to be selected between five given options based on a three-minute-long video clip.
We find Ego to have intrinsic temporal lengths over 5.7x longer than the second closest dataset and 10x longer than any other video understanding dataset.
arXiv Detail & Related papers (2023-08-17T17:59:59Z) - NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation [157.07019458623242]
NUWA-XL is a novel Diffusion over Diffusion architecture for eXtremely Long generation.
Our approach adopts a coarse-to-fine'' process, in which the video can be generated in parallel at the same granularity.
Experiments show that our model not only generates high-quality long videos with both global and local coherence, but also decreases the average inference time from 7.55min to 26s.
arXiv Detail & Related papers (2023-03-22T07:10:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.