Perception Test: A Diagnostic Benchmark for Multimodal Video Models
- URL: http://arxiv.org/abs/2305.13786v2
- Date: Mon, 30 Oct 2023 18:35:48 GMT
- Title: Perception Test: A Diagnostic Benchmark for Multimodal Video Models
- Authors: Viorica P\u{a}tr\u{a}ucean, Lucas Smaira, Ankush Gupta, Adri\`a
Recasens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph
Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova,
Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster,
Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen,
Andrew Zisserman, Jo\~ao Carreira
- Abstract summary: We propose a novel multimodal video benchmark to evaluate the perception and reasoning skills of pre-trained multimodal models.
The Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities.
The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime.
- Score: 78.64546291816117
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel multimodal video benchmark - the Perception Test - to
evaluate the perception and reasoning skills of pre-trained multimodal models
(e.g. Flamingo, SeViLA, or GPT-4). Compared to existing benchmarks that focus
on computational tasks (e.g. classification, detection or tracking), the
Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and
types of reasoning (descriptive, explanatory, predictive, counterfactual)
across video, audio, and text modalities, to provide a comprehensive and
efficient evaluation tool. The benchmark probes pre-trained models for their
transfer capabilities, in a zero-shot / few-shot or limited finetuning regime.
For these purposes, the Perception Test introduces 11.6k real-world videos, 23s
average length, designed to show perceptually interesting situations, filmed by
around 100 participants worldwide. The videos are densely annotated with six
types of labels (multiple-choice and grounded video question-answers, object
and point tracks, temporal action and sound segments), enabling both language
and non-language evaluations. The fine-tuning and validation splits of the
benchmark are publicly available (CC-BY license), in addition to a challenge
server with a held-out test split. Human baseline results compared to
state-of-the-art video QA models show a substantial gap in performance (91.4%
vs 46.2%), suggesting that there is significant room for improvement in
multimodal video understanding.
Dataset, baseline code, and challenge server are available at
https://github.com/deepmind/perception_test
Related papers
- TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models [75.42002690128486]
TemporalBench is a new benchmark dedicated to evaluating fine-grained temporal understanding in videos.
It consists of 10K video question-answer pairs, derived from 2K high-quality human annotations detailing the temporal dynamics in video clips.
Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench.
arXiv Detail & Related papers (2024-10-14T17:59:58Z) - TVBench: Redesigning Video-Language Evaluation [48.71203934876828]
We show that the currently most used video-language benchmarks can be solved without requiring much temporal reasoning.
We propose TVBench, a novel open-source video multiple-choice question-answering benchmark.
arXiv Detail & Related papers (2024-10-10T09:28:36Z) - E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding [57.630136434038384]
We introduce E.T. Bench (Event-Level & Time-Sensitive Video Understanding Benchmark), a large-scale benchmark for open-ended event-level video understanding.
We extensively evaluated 8 Image-LLMs and 12 Video-LLMs on our benchmark, and the results reveal that state-of-the-art models for coarse-level (video-level) understanding struggle to solve our fine-grained tasks.
Our simple but effective solution demonstrates superior performance in multiple scenarios.
arXiv Detail & Related papers (2024-09-26T17:53:04Z) - Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning [50.26965628047682]
Adapting pre-trained models to open classes is a challenging problem in machine learning.
In this paper, we consider combining the advantages of both and come up with a test-time prompt tuning approach.
Our proposed method outperforms all comparison methods on average considering both base and new classes.
arXiv Detail & Related papers (2024-08-29T12:34:01Z) - Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs [20.168429351519055]
Video understanding is a crucial next step for multimodal large language models (LMLMs)
We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation.
We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities.
arXiv Detail & Related papers (2024-06-13T17:50:05Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z) - Fill-in-the-blank as a Challenging Video Understanding Evaluation
Framework [19.031957183047048]
We introduce a novel dataset consisting of 28,000 videos and fill-in-the-blank tests.
We show that both a multimodal model and a strong language model have a large gap with human performance.
arXiv Detail & Related papers (2021-04-09T04:00:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.