LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
- URL: http://arxiv.org/abs/2407.15754v1
- Date: Mon, 22 Jul 2024 16:00:55 GMT
- Title: LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
- Authors: Haoning Wu, Dongxu Li, Bei Chen, Junnan Li,
- Abstract summary: LongVideoBench is a question-answering benchmark that features video-language interleaved inputs up to an hour long.
Our benchmark includes 3,763 varying-length web-collected videos with their subtitles across diverse themes.
We formulate a novel video question-answering task termed referring reasoning.
- Score: 41.9477837230283
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large multimodal models (LMMs) are processing increasingly longer and richer inputs. Albeit the progress, few public benchmark is available to measure such development. To mitigate this gap, we introduce LongVideoBench, a question-answering benchmark that features video-language interleaved inputs up to an hour long. Our benchmark includes 3,763 varying-length web-collected videos with their subtitles across diverse themes, designed to comprehensively evaluate LMMs on long-term multimodal understanding. To achieve this, we interpret the primary challenge as to accurately retrieve and reason over detailed multimodal information from long inputs. As such, we formulate a novel video question-answering task termed referring reasoning. Specifically, as part of the question, it contains a referring query that references related video contexts, called referred context. The model is then required to reason over relevant video details from the referred context. Following the paradigm of referring reasoning, we curate 6,678 human-annotated multiple-choice questions in 17 fine-grained categories, establishing one of the most comprehensive benchmarks for long-form video understanding. Evaluations suggest that the LongVideoBench presents significant challenges even for the most advanced proprietary models (e.g. GPT-4o, Gemini-1.5-Pro, GPT-4-Turbo), while their open-source counterparts show an even larger performance gap. In addition, our results indicate that model performance on the benchmark improves only when they are capable of processing more frames, positioning LongVideoBench as a valuable benchmark for evaluating future-generation long-context LMMs.
Related papers
- ViLLa: Video Reasoning Segmentation with Large Language Model [48.75470418596875]
We propose a new video segmentation task - video reasoning segmentation.
The task is designed to output tracklets of segmentation masks given a complex input text query.
We present ViLLa: Video reasoning segmentation with a Large Language Model.
arXiv Detail & Related papers (2024-07-18T17:59:17Z) - Goldfish: Vision-Language Understanding of Arbitrarily Long Videos [51.547065479762715]
We present a methodology tailored for comprehending videos of arbitrary lengths.
We also introduce the TVQA-long benchmark, designed to evaluate models' capabilities in understanding long videos with questions in both vision and text content.
Our results indicate that our models have significant improvements in both long and short-video understanding.
arXiv Detail & Related papers (2024-07-17T15:59:32Z) - InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding [31.147208579511247]
InfiniBench is a comprehensive benchmark for very long video understanding.
It presents 1)The longest video duration, averaging 76.34 minutes; 2) The largest number of question-answer pairs, 108.2K; 3) Diversity in questions that examine nine different skills and include both multiple-choice questions and open-ended questions; 4) Humancentric, as the video sources come from movies and daily TV shows.
Our results show that the best AI models such Gemini struggles to perform well with 42.72% average accuracy and 2.71 out of 5 average score.
arXiv Detail & Related papers (2024-06-28T12:35:01Z) - MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding [67.56182262082729]
We introduce MMBench-Video, a quantitative benchmark to rigorously evaluate large vision-language models (LVLMs) in video understanding.
MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases.
The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy.
arXiv Detail & Related papers (2024-06-20T17:26:01Z) - Needle In A Video Haystack: A Scalable Synthetic Framework for Benchmarking Video MLLMs [20.168429351519055]
We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation.
VideoNIAH decouples test video content from query-responses by inserting unrelated image/text 'needles' into original videos.
It generates annotations solely from these needles, ensuring diversity in video sources and a variety of query-responses.
arXiv Detail & Related papers (2024-06-13T17:50:05Z) - LVBench: An Extreme Long Video Understanding Benchmark [37.22510741049044]
We introduce LVBench, a benchmark specifically designed for long video understanding.
Our dataset comprises publicly sourced videos and encompasses a diverse set of tasks aimed at long video comprehension and information extraction.
arXiv Detail & Related papers (2024-06-12T09:36:52Z) - LongVLM: Efficient Long Video Understanding via Large Language Models [55.813206751150716]
LongVLM is a simple yet powerful VideoLLM for long video understanding.
We encode video representations that incorporate both local and global information.
Our model produces more precise responses for long video understanding.
arXiv Detail & Related papers (2024-04-04T11:33:29Z) - Understanding Long Videos in One Multimodal Language Model Pass [44.78900245769057]
Large Language Models (LLMs) are known to contain a strong awareness of world knowledge.
We propose Likelihood Selection, a technique that unlocks faster inference in autoregressive LLMs.
Our resulting Multimodal Video Understanding framework demonstrates state-of-the-art performance across long-video and fine-grained action recognition benchmarks.
arXiv Detail & Related papers (2024-03-25T17:59:09Z) - MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie
Understanding [69.04413943858584]
We introduce MoVQA, a long-form movie question-answering dataset.
We also benchmark to assess the diverse cognitive capabilities of multimodal systems.
arXiv Detail & Related papers (2023-12-08T03:33:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.