Neptune: The Long Orbit to Benchmarking Long Video Understanding
- URL: http://arxiv.org/abs/2412.09582v2
- Date: Sat, 18 Jan 2025 00:52:42 GMT
- Title: Neptune: The Long Orbit to Benchmarking Long Video Understanding
- Authors: Arsha Nagrani, Mingda Zhang, Ramin Mehran, Rachel Hornung, Nitesh Bharadwaj Gundavarapu, Nilpa Jha, Austin Myers, Xingyi Zhou, Boqing Gong, Cordelia Schmid, Mikhail Sirotenko, Yukun Zhu, Tobias Weyand,
- Abstract summary: We introduce Neptune, a benchmark for long video understanding that requires reasoning over long time horizons and across different modalities.
Our dataset covers a broad range of long video reasoning abilities and consists of a subset that emphasizes multimodal reasoning.
Benchmark evaluations reveal that most current open-source long video models perform poorly on Neptune.
- Score: 73.96154871970062
- License:
- Abstract: We introduce Neptune, a benchmark for long video understanding that requires reasoning over long time horizons and across different modalities. Many existing video datasets and models are focused on short clips (10s-30s). While some long video datasets do exist, they can often be solved by powerful image models applied per frame (and often to very few frames) in a video, and are usually manually annotated at high cost. In order to mitigate both these problems, we propose a scalable dataset creation pipeline which leverages large models (VLMs and LLMs), to automatically generate dense, time-aligned video captions, as well as tough question answer decoy sets for video segments (up to 15 minutes in length). Our dataset Neptune covers a broad range of long video reasoning abilities and consists of a subset that emphasizes multimodal reasoning. Since existing metrics for open-ended question answering are either rule-based or may rely on proprietary models, we provide a new open source model-based metric GEM to score open-ended responses on Neptune. Benchmark evaluations reveal that most current open-source long video models perform poorly on Neptune, particularly on questions testing temporal ordering, counting and state changes. Through Neptune, we aim to spur the development of more advanced models capable of understanding long videos. The dataset is available at https://github.com/google-deepmind/neptune
Related papers
- CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding [43.858197893052115]
CG-Bench is a novel benchmark for clue-grounded question answering in long videos.
It features 1,219 manually curated videos categorized by a granular system with 14 primary categories, 171 secondary categories, and 638 tertiary categories.
The benchmark includes 12,129 QA pairs in three major question types: perception, reasoning, and hallucination.
arXiv Detail & Related papers (2024-12-16T18:46:45Z) - Owl-1: Omni World Model for Consistent Long Video Generation [75.51378346789626]
We propose an Omni World modeL (Owl-1) to produce long-term coherent and comprehensive conditions for consistent long video generation.
Owl-1 achieves comparable performance with SOTA methods on VBench-I2V and VBench-Long.
arXiv Detail & Related papers (2024-12-12T18:59:01Z) - LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding [41.9477837230283]
LongVideoBench is a question-answering benchmark that features video-language interleaved inputs up to an hour long.
Our benchmark includes 3,763 varying-length web-collected videos with their subtitles across diverse themes.
We formulate a novel video question-answering task termed referring reasoning.
arXiv Detail & Related papers (2024-07-22T16:00:55Z) - Goldfish: Vision-Language Understanding of Arbitrarily Long Videos [51.547065479762715]
We present a methodology tailored for comprehending videos of arbitrary lengths.
We also introduce the TVQA-long benchmark, designed to evaluate models' capabilities in understanding long videos with questions in both vision and text content.
Our results indicate that our models have significant improvements in both long and short-video understanding.
arXiv Detail & Related papers (2024-07-17T15:59:32Z) - Hallucination Mitigation Prompts Long-term Video Understanding [36.26790392889717]
This paper constructs a comprehensive hallucination mitigation pipeline based on existing MLLMs.
We use the CLIP Score to guide the frame sampling process with questions, selecting key frames relevant to the question.
During the answer generation stage, we utilize chain-of-thought and in-context learning techniques to explicitly control the generation of answers.
arXiv Detail & Related papers (2024-06-17T08:44:03Z) - LVBench: An Extreme Long Video Understanding Benchmark [38.839913137854104]
We introduce LVBench, a benchmark specifically designed for long video understanding.
Our dataset comprises publicly sourced videos and encompasses a diverse set of tasks aimed at long video comprehension and information extraction.
arXiv Detail & Related papers (2024-06-12T09:36:52Z) - EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language
Understanding [53.275916136138996]
Ego is a very long-form video question-answering dataset, spanning over 250 hours of real video data.
For each question, Ego requires the correct answer to be selected between five given options based on a three-minute-long video clip.
We find Ego to have intrinsic temporal lengths over 5.7x longer than the second closest dataset and 10x longer than any other video understanding dataset.
arXiv Detail & Related papers (2023-08-17T17:59:59Z) - NVDS+: Towards Efficient and Versatile Neural Stabilizer for Video Depth Estimation [58.21817572577012]
Video depth estimation aims to infer temporally consistent depth.
We introduce NVDS+ that stabilizes inconsistent depth estimated by various single-image models in a plug-and-play manner.
We also elaborate a large-scale Video Depth in the Wild dataset, which contains 14,203 videos with over two million frames.
arXiv Detail & Related papers (2023-07-17T17:57:01Z) - NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation [157.07019458623242]
NUWA-XL is a novel Diffusion over Diffusion architecture for eXtremely Long generation.
Our approach adopts a coarse-to-fine'' process, in which the video can be generated in parallel at the same granularity.
Experiments show that our model not only generates high-quality long videos with both global and local coherence, but also decreases the average inference time from 7.55min to 26s.
arXiv Detail & Related papers (2023-03-22T07:10:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.