Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating
Video-based Large Language Models
- URL: http://arxiv.org/abs/2311.16103v2
- Date: Tue, 28 Nov 2023 18:16:29 GMT
- Title: Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating
Video-based Large Language Models
- Authors: Munan Ning and Bin Zhu and Yujia Xie and Bin Lin and Jiaxi Cui and Lu
Yuan and Dongdong Chen and Li Yuan
- Abstract summary: Video-based large language models (Video-LLMs) have been recently introduced, targeting both fundamental improvements in perception and comprehension, and a diverse range of user inquiries.
To guide the development of such a model, the establishment of a robust and comprehensive evaluation system becomes crucial.
This paper proposes textitVideo-Bench, a new comprehensive benchmark along with a toolkit specifically designed for evaluating Video-LLMs.
- Score: 81.84810348214113
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-based large language models (Video-LLMs) have been recently introduced,
targeting both fundamental improvements in perception and comprehension, and a
diverse range of user inquiries. In pursuit of the ultimate goal of achieving
artificial general intelligence, a truly intelligent Video-LLM model should not
only see and understand the surroundings, but also possess human-level
commonsense, and make well-informed decisions for the users. To guide the
development of such a model, the establishment of a robust and comprehensive
evaluation system becomes crucial. To this end, this paper proposes
\textit{Video-Bench}, a new comprehensive benchmark along with a toolkit
specifically designed for evaluating Video-LLMs. The benchmark comprises 10
meticulously crafted tasks, evaluating the capabilities of Video-LLMs across
three distinct levels: Video-exclusive Understanding, Prior Knowledge-based
Question-Answering, and Comprehension and Decision-making. In addition, we
introduce an automatic toolkit tailored to process model outputs for various
tasks, facilitating the calculation of metrics and generating convenient final
scores. We evaluate 8 representative Video-LLMs using \textit{Video-Bench}. The
findings reveal that current Video-LLMs still fall considerably short of
achieving human-like comprehension and analysis of real-world videos, offering
valuable insights for future research directions. The benchmark and toolkit are
available at: \url{https://github.com/PKU-YuanGroup/Video-Bench}.
Related papers
- TVBench: Redesigning Video-Language Evaluation [48.71203934876828]
We show that the currently most used video-language benchmarks can be solved without requiring much temporal reasoning.
We propose TVBench, a novel open-source video multiple-choice question-answering benchmark.
arXiv Detail & Related papers (2024-10-10T09:28:36Z) - MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding [67.56182262082729]
We introduce MMBench-Video, a quantitative benchmark to rigorously evaluate large vision-language models (LVLMs) in video understanding.
MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases.
The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy.
arXiv Detail & Related papers (2024-06-20T17:26:01Z) - Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs [20.168429351519055]
Video understanding is a crucial next step for multimodal large language models (LMLMs)
We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation.
We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities.
arXiv Detail & Related papers (2024-06-13T17:50:05Z) - How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs [98.37571997794072]
We present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES)
CVRR-ES comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions.
Our findings provide valuable insights for building the next generation of human-centric AI systems.
arXiv Detail & Related papers (2024-05-06T17:59:45Z) - EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [70.19437817951673]
We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities.
Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation.
Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics.
arXiv Detail & Related papers (2023-10-17T17:50:46Z) - DeVAn: Dense Video Annotation for Video-Language Models [68.70692422636313]
We present a novel human annotated dataset for evaluating the ability for visual-language models to generate descriptions for real-world video clips.
The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests.
arXiv Detail & Related papers (2023-10-08T08:02:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.