Related papers: Perception Test 2023: A Summary of the First Challenge And Outcome

Perception Test 2023: A Summary of the First Challenge And Outcome

URL: http://arxiv.org/abs/2312.13090v1
Date: Wed, 20 Dec 2023 15:12:27 GMT
Title: Perception Test 2023: A Summary of the First Challenge And Outcome
Authors: Joseph Heyward, Jo\~ao Carreira, Dima Damen, Andrew Zisserman, Viorica P\u{a}tr\u{a}ucean
Abstract summary: The First Perception Test challenge was held as a half-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2023. The goal was to benchmarking state-of-the-art video models on the recently proposed Perception Test benchmark. We summarise in this report the task descriptions, metrics, baselines, and results.
Score: 67.0525378209708
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The First Perception Test challenge was held as a half-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2023, with the goal of benchmarking state-of-the-art video models on the recently proposed Perception Test benchmark. The challenge had six tracks covering low-level and high-level tasks, with both a language and non-language interface, across video, audio, and text modalities, and covering: object tracking, point tracking, temporal action localisation, temporal sound localisation, multiple-choice video question-answering, and grounded video question-answering. We summarise in this report the task descriptions, metrics, baselines, and results.

Related papers

Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark [64.16672247204997]
We organised the Second Perception Test challenge as a half-day workshop alongside the IEEE/CVF European Conference on Computer Vision (ECCV) 2024. The goal was to benchmarking state-of-the-art video models and measuring the progress since last year using the Perception Test benchmark. This year, the challenge had seven tracks and covered low-level and high-level tasks, with language and non-language interfaces, across video, audio, and text modalities. The additional track covered hour-long video understanding and introduced a novel video QA benchmark 1h-walk VQA.
arXiv Detail & Related papers (2024-11-29T18:57:25Z)
AIM 2024 Challenge on Video Saliency Prediction: Methods and Results [105.09572982350532]
This paper reviews the Challenge on Video Saliency Prediction at AIM 2024. The goal of the participants was to develop a method for predicting accurate saliency maps for the provided set of video sequences.
arXiv Detail & Related papers (2024-09-23T08:59:22Z)
The 2nd Solution for LSVOS Challenge RVOS Track: Spatial-temporal Refinement for Consistent Semantic Segmentation [0.0]
We propose a method to enhance the temporal consistency of the referring object segmentation model. Our method placed 2nd in the final ranking of the RVOS Track at the ECCV 2024 LSVOS Challenge.
arXiv Detail & Related papers (2024-08-22T14:43:02Z)
2nd Place Solution for MeViS Track in CVPR 2024 PVUW Workshop: Motion Expression guided Video Segmentation [8.20168024462357]
Motion Expression guided Video is a challenging task that aims at segmenting objects in the video based on natural language expressions with motion descriptions. We introduce mask information obtained from the video instance segmentation model as preliminary information for temporal enhancement and employ SAM for spatial refinement. Our method achieved a score of 49.92 J &F in the validation phase and 54.20 J &F in the test phase, securing the final ranking of 2nd in the MeViS Track at the CVPR 2024 PVUW Challenge.
arXiv Detail & Related papers (2024-06-20T02:16:23Z)
Perception Test: A Diagnostic Benchmark for Multimodal Video Models [78.64546291816117]
We propose a novel multimodal video benchmark to evaluate the perception and reasoning skills of pre-trained multimodal models. The Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities. The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime.
arXiv Detail & Related papers (2023-05-23T07:54:37Z)
The 2021 NIST Speaker Recognition Evaluation [1.5282767384702267]
The 2021 Speaker Recognition Evaluation (SRE21) was the latest cycle of the ongoing evaluation series conducted by the U.S. National Institute of Standards and Technology (NIST) since 1996. This paper presents an overview of SRE21 including the tasks, performance metric, data, evaluation protocol, results and system performance analyses.
arXiv Detail & Related papers (2022-04-21T16:18:52Z)
Mobile App Tasks with Iterative Feedback (MoTIF): Addressing Task Feasibility in Interactive Visual Environments [54.405920619915655]
We introduce Mobile app Tasks with Iterative Feedback (MoTIF), a dataset with natural language commands for the greatest number of interactive environments to date. MoTIF is the first to contain natural language requests for interactive environments that are not satisfiable. We perform initial feasibility classification experiments and only reach an F1 score of 37.3, verifying the need for richer vision-language representations.
arXiv Detail & Related papers (2021-04-17T14:48:02Z)
The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020) [186.7816349401443]
We present a new video understanding pentathlon challenge, an open competition held in conjunction with the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2020. The objective of the challenge was to explore and evaluate new methods for text-to-video retrieval.
arXiv Detail & Related papers (2020-08-03T09:55:26Z)
The NTT DCASE2020 Challenge Task 6 system: Automated Audio Captioning with Keywords and Sentence Length Estimation [49.41766997393417]
This report describes the system participating to the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 Challenge, Task 6. Our submission focuses on solving two indeterminacy problems in automated audio captioning: word selection indeterminacy and sentence length indeterminacy. We simultaneously solve the main caption generation and sub indeterminacy problems by estimating keywords and sentence length through multi-task learning.
arXiv Detail & Related papers (2020-07-01T04:26:27Z)
Dense-Captioning Events in Videos: SYSU Submission to ActivityNet Challenge 2020 [8.462158729006715]
This report presents a brief description of our submission to the dense video captioning task of ActivityNet Challenge 2020. Our approach achieves a 9.28 METEOR score on the test set.
arXiv Detail & Related papers (2020-06-21T02:38:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.