Perception Test 2023: A Summary of the First Challenge And Outcome
- URL: http://arxiv.org/abs/2312.13090v1
- Date: Wed, 20 Dec 2023 15:12:27 GMT
- Title: Perception Test 2023: A Summary of the First Challenge And Outcome
- Authors: Joseph Heyward, Jo\~ao Carreira, Dima Damen, Andrew Zisserman, Viorica
P\u{a}tr\u{a}ucean
- Abstract summary: The First Perception Test challenge was held as a half-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2023.
The goal was to benchmarking state-of-the-art video models on the recently proposed Perception Test benchmark.
We summarise in this report the task descriptions, metrics, baselines, and results.
- Score: 67.0525378209708
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The First Perception Test challenge was held as a half-day workshop alongside
the IEEE/CVF International Conference on Computer Vision (ICCV) 2023, with the
goal of benchmarking state-of-the-art video models on the recently proposed
Perception Test benchmark. The challenge had six tracks covering low-level and
high-level tasks, with both a language and non-language interface, across
video, audio, and text modalities, and covering: object tracking, point
tracking, temporal action localisation, temporal sound localisation,
multiple-choice video question-answering, and grounded video
question-answering. We summarise in this report the task descriptions, metrics,
baselines, and results.
Related papers
- AIM 2024 Challenge on Video Saliency Prediction: Methods and Results [105.09572982350532]
This paper reviews the Challenge on Video Saliency Prediction at AIM 2024.
The goal of the participants was to develop a method for predicting accurate saliency maps for the provided set of video sequences.
arXiv Detail & Related papers (2024-09-23T08:59:22Z) - The 2nd Solution for LSVOS Challenge RVOS Track: Spatial-temporal Refinement for Consistent Semantic Segmentation [0.0]
We propose a method to enhance the temporal consistency of the referring object segmentation model.
Our method placed 2nd in the final ranking of the RVOS Track at the ECCV 2024 LSVOS Challenge.
arXiv Detail & Related papers (2024-08-22T14:43:02Z) - 2nd Place Solution for MeViS Track in CVPR 2024 PVUW Workshop: Motion Expression guided Video Segmentation [8.20168024462357]
Motion Expression guided Video is a challenging task that aims at segmenting objects in the video based on natural language expressions with motion descriptions.
We introduce mask information obtained from the video instance segmentation model as preliminary information for temporal enhancement and employ SAM for spatial refinement.
Our method achieved a score of 49.92 J &F in the validation phase and 54.20 J &F in the test phase, securing the final ranking of 2nd in the MeViS Track at the CVPR 2024 PVUW Challenge.
arXiv Detail & Related papers (2024-06-20T02:16:23Z) - Perception Test: A Diagnostic Benchmark for Multimodal Video Models [78.64546291816117]
We propose a novel multimodal video benchmark to evaluate the perception and reasoning skills of pre-trained multimodal models.
The Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities.
The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime.
arXiv Detail & Related papers (2023-05-23T07:54:37Z) - The 2021 NIST Speaker Recognition Evaluation [1.5282767384702267]
The 2021 Speaker Recognition Evaluation (SRE21) was the latest cycle of the ongoing evaluation series conducted by the U.S. National Institute of Standards and Technology (NIST) since 1996.
This paper presents an overview of SRE21 including the tasks, performance metric, data, evaluation protocol, results and system performance analyses.
arXiv Detail & Related papers (2022-04-21T16:18:52Z) - Mobile App Tasks with Iterative Feedback (MoTIF): Addressing Task
Feasibility in Interactive Visual Environments [54.405920619915655]
We introduce Mobile app Tasks with Iterative Feedback (MoTIF), a dataset with natural language commands for the greatest number of interactive environments to date.
MoTIF is the first to contain natural language requests for interactive environments that are not satisfiable.
We perform initial feasibility classification experiments and only reach an F1 score of 37.3, verifying the need for richer vision-language representations.
arXiv Detail & Related papers (2021-04-17T14:48:02Z) - The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020) [186.7816349401443]
We present a new video understanding pentathlon challenge, an open competition held in conjunction with the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2020.
The objective of the challenge was to explore and evaluate new methods for text-to-video retrieval.
arXiv Detail & Related papers (2020-08-03T09:55:26Z) - The NTT DCASE2020 Challenge Task 6 system: Automated Audio Captioning
with Keywords and Sentence Length Estimation [49.41766997393417]
This report describes the system participating to the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 Challenge, Task 6.
Our submission focuses on solving two indeterminacy problems in automated audio captioning: word selection indeterminacy and sentence length indeterminacy.
We simultaneously solve the main caption generation and sub indeterminacy problems by estimating keywords and sentence length through multi-task learning.
arXiv Detail & Related papers (2020-07-01T04:26:27Z) - Dense-Captioning Events in Videos: SYSU Submission to ActivityNet
Challenge 2020 [8.462158729006715]
This report presents a brief description of our submission to the dense video captioning task of ActivityNet Challenge 2020.
Our approach achieves a 9.28 METEOR score on the test set.
arXiv Detail & Related papers (2020-06-21T02:38:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.