Perception Test: A Diagnostic Benchmark for Multimodal Video Models
- URL: http://arxiv.org/abs/2305.13786v2
- Date: Mon, 30 Oct 2023 18:35:48 GMT
- Title: Perception Test: A Diagnostic Benchmark for Multimodal Video Models
- Authors: Viorica P\u{a}tr\u{a}ucean, Lucas Smaira, Ankush Gupta, Adri\`a
Recasens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph
Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova,
Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster,
Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen,
Andrew Zisserman, Jo\~ao Carreira
- Abstract summary: We propose a novel multimodal video benchmark to evaluate the perception and reasoning skills of pre-trained multimodal models.
The Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities.
The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime.
- Score: 78.64546291816117
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel multimodal video benchmark - the Perception Test - to
evaluate the perception and reasoning skills of pre-trained multimodal models
(e.g. Flamingo, SeViLA, or GPT-4). Compared to existing benchmarks that focus
on computational tasks (e.g. classification, detection or tracking), the
Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and
types of reasoning (descriptive, explanatory, predictive, counterfactual)
across video, audio, and text modalities, to provide a comprehensive and
efficient evaluation tool. The benchmark probes pre-trained models for their
transfer capabilities, in a zero-shot / few-shot or limited finetuning regime.
For these purposes, the Perception Test introduces 11.6k real-world videos, 23s
average length, designed to show perceptually interesting situations, filmed by
around 100 participants worldwide. The videos are densely annotated with six
types of labels (multiple-choice and grounded video question-answers, object
and point tracks, temporal action and sound segments), enabling both language
and non-language evaluations. The fine-tuning and validation splits of the
benchmark are publicly available (CC-BY license), in addition to a challenge
server with a held-out test split. Human baseline results compared to
state-of-the-art video QA models show a substantial gap in performance (91.4%
vs 46.2%), suggesting that there is significant room for improvement in
multimodal video understanding.
Dataset, baseline code, and challenge server are available at
https://github.com/deepmind/perception_test
Related papers
- Towards Event-oriented Long Video Understanding [101.48089908037888]
Event-Bench is an event-oriented long video understanding benchmark built on existing datasets and human annotations.
VIM is a cost-effective method that enhances video MLLMs using merged, event-intensive video instructions.
arXiv Detail & Related papers (2024-06-20T09:14:19Z) - AIR-Bench: Benchmarking Large Audio-Language Models via Generative
Comprehension [98.69691822391069]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format.
Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z) - SportsSloMo: A New Benchmark and Baselines for Human-centric Video Frame
Interpolation [11.198172694893927]
SportsSloMo is a benchmark consisting of more than 130K video clips and 1M video frames of high-resolution ($geq$720p) slow-motion sports videos crawled from YouTube.
We re-train several state-of-the-art methods on our benchmark, and the results show a decrease in their accuracy compared to other datasets.
We introduce two loss terms considering the human-aware priors, where we add auxiliary supervision to panoptic segmentation and human keypoints detection.
arXiv Detail & Related papers (2023-08-31T17:23:50Z) - A Control-Centric Benchmark for Video Prediction [69.22614362800692]
We propose a benchmark for action-conditioned video prediction in the form of a control benchmark.
Our benchmark includes simulated environments with 11 task categories and 310 task instance definitions.
We then leverage our benchmark to study the effects of scaling model size, quantity of training data, and model ensembling.
arXiv Detail & Related papers (2023-04-26T17:59:45Z) - UATVR: Uncertainty-Adaptive Text-Video Retrieval [90.8952122146241]
A common practice is to transfer text-video pairs to the same embedding space and craft cross-modal interactions with certain entities.
We propose an Uncertainty-language Text-Video Retrieval approach, termed UATVR, which models each look-up as a distribution matching procedure.
arXiv Detail & Related papers (2023-01-16T08:43:17Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z) - Fill-in-the-blank as a Challenging Video Understanding Evaluation
Framework [19.031957183047048]
We introduce a novel dataset consisting of 28,000 videos and fill-in-the-blank tests.
We show that both a multimodal model and a strong language model have a large gap with human performance.
arXiv Detail & Related papers (2021-04-09T04:00:10Z) - Self-supervised pre-training and contrastive representation learning for
multiple-choice video QA [39.78914328623504]
Video Question Answering (Video QA) requires fine-grained understanding of both video and language modalities to answer the given questions.
We propose novel training schemes for multiple-choice video question answering with a self-supervised pre-training stage and a supervised contrastive learning in the main stage as an auxiliary learning.
We evaluate our proposed model on highly competitive benchmark datasets related to multiple-choice video QA: TVQA, TVQA+, and DramaQA.
arXiv Detail & Related papers (2020-09-17T03:37:37Z) - STAViS: Spatio-Temporal AudioVisual Saliency Network [45.04894808904767]
STAViS is a network that combines visual saliency and auditory features.
It learns to appropriately localize sound sources and to fuse the two saliencies in order to obtain a final saliency map.
We compare our method against 8 different state-of-the-art visual saliency models.
arXiv Detail & Related papers (2020-01-09T15:34:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.