Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
- URL: http://arxiv.org/abs/2405.21075v2
- Date: Sun, 16 Jun 2024 15:49:12 GMT
- Title: Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
- Authors: Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Rongrong Ji, Xing Sun,
- Abstract summary: Video-MME is the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis.
We extensively evaluate various state-of-the-art MLLMs, including GPT-4 series and Gemini 1.5 Pro, as well as open-source image models.
Our experiments reveal that Gemini 1.5 Pro is the best-performing commercial model, significantly outperforming the open-source models.
- Score: 118.08008540513596
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements. However, the predominant focus remains on developing their capabilities in static image understanding. The potential of MLLMs in processing sequential visual data is still insufficiently explored, highlighting the absence of a comprehensive, high-quality assessment of their performance. In this paper, we introduce Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis. Our work distinguishes from existing benchmarks through four key features: 1) Diversity in video types, spanning 6 primary visual domains with 30 subfields to ensure broad scenario generalizability; 2) Duration in temporal dimension, encompassing both short-, medium-, and long-term videos, ranging from 11 seconds to 1 hour, for robust contextual dynamics; 3) Breadth in data modalities, integrating multi-modal inputs besides video frames, including subtitles and audios, to unveil the all-round capabilities of MLLMs; 4) Quality in annotations, utilizing rigorous manual labeling by expert annotators to facilitate precise and reliable model assessment. 900 videos with a total of 254 hours are manually selected and annotated by repeatedly viewing all the video content, resulting in 2,700 question-answer pairs. With Video-MME, we extensively evaluate various state-of-the-art MLLMs, including GPT-4 series and Gemini 1.5 Pro, as well as open-source image models like InternVL-Chat-V1.5 and video models like LLaVA-NeXT-Video. Our experiments reveal that Gemini 1.5 Pro is the best-performing commercial model, significantly outperforming the open-source models. Our dataset along with these findings underscores the need for further improvements in handling longer sequences and multi-modal data. Project Page: https://video-mme.github.io
Related papers
- StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding [48.24581407583288]
StreamingBench is the first comprehensive benchmark designed to evaluate the streaming video understanding capabilities of MLLMs.
We conduct experiments on StreamingBench with 13 open-source and proprietary MLLMs and find that even the most advanced proprietary MLLMs perform significantly below human-level streaming video understanding capabilities.
arXiv Detail & Related papers (2024-11-06T02:50:30Z) - MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding [67.56182262082729]
We introduce MMBench-Video, a quantitative benchmark to rigorously evaluate large vision-language models (LVLMs) in video understanding.
MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases.
The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy.
arXiv Detail & Related papers (2024-06-20T17:26:01Z) - Towards Event-oriented Long Video Understanding [101.48089908037888]
Event-Bench is an event-oriented long video understanding benchmark built on existing datasets and human annotations.
VIM is a cost-effective method that enhances video MLLMs using merged, event-intensive video instructions.
arXiv Detail & Related papers (2024-06-20T09:14:19Z) - VideoVista: A Versatile Benchmark for Video Understanding and Reasoning [46.838692817107116]
We present VideoVista, a video QA benchmark that integrates challenges across diverse content categories, durations, and abilities.
VideoVista comprises 25,000 questions derived from 3,400 videos spanning 14 categories (e.g., Howto, Film, and Entertainment) with durations ranging from a few seconds to over 10 minutes.
It encompasses 19 types of understanding tasks (e.g., anomaly detection, interaction understanding) and 8 reasoning tasks (e.g., logical reasoning, causal reasoning)
arXiv Detail & Related papers (2024-06-17T08:09:00Z) - VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs [55.82090875098132]
VideoLLaMA 2 is a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks.
VideoLLaMA 2 consistently achieves competitive results among open-source models and even gets close to some proprietary models on several benchmarks.
arXiv Detail & Related papers (2024-06-11T17:22:23Z) - How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs [98.37571997794072]
We present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES)
CVRR-ES comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions.
Our findings provide valuable insights for building the next generation of human-centric AI systems.
arXiv Detail & Related papers (2024-05-06T17:59:45Z) - MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [63.14000659130736]
We introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench.
We first introduce a novel static-to-dynamic method to define these temporal-related tasks.
Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task.
arXiv Detail & Related papers (2023-11-28T17:59:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.