Perception Test 2023: A Summary of the First Challenge And Outcome
- URL: http://arxiv.org/abs/2312.13090v1
- Date: Wed, 20 Dec 2023 15:12:27 GMT
- Title: Perception Test 2023: A Summary of the First Challenge And Outcome
- Authors: Joseph Heyward, Jo\~ao Carreira, Dima Damen, Andrew Zisserman, Viorica
P\u{a}tr\u{a}ucean
- Abstract summary: The First Perception Test challenge was held as a half-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2023.
The goal was to benchmarking state-of-the-art video models on the recently proposed Perception Test benchmark.
We summarise in this report the task descriptions, metrics, baselines, and results.
- Score: 67.0525378209708
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The First Perception Test challenge was held as a half-day workshop alongside
the IEEE/CVF International Conference on Computer Vision (ICCV) 2023, with the
goal of benchmarking state-of-the-art video models on the recently proposed
Perception Test benchmark. The challenge had six tracks covering low-level and
high-level tasks, with both a language and non-language interface, across
video, audio, and text modalities, and covering: object tracking, point
tracking, temporal action localisation, temporal sound localisation,
multiple-choice video question-answering, and grounded video
question-answering. We summarise in this report the task descriptions, metrics,
baselines, and results.
Related papers
- Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark [64.16672247204997]
We organised the Second Perception Test challenge as a half-day workshop alongside the IEEE/CVF European Conference on Computer Vision (ECCV) 2024.
The goal was to benchmarking state-of-the-art video models and measuring the progress since last year using the Perception Test benchmark.
This year, the challenge had seven tracks and covered low-level and high-level tasks, with language and non-language interfaces, across video, audio, and text modalities.
The additional track covered hour-long video understanding and introduced a novel video QA benchmark 1h-walk VQA.
arXiv Detail & Related papers (2024-11-29T18:57:25Z) - AIM 2024 Challenge on Video Saliency Prediction: Methods and Results [105.09572982350532]
This paper reviews the Challenge on Video Saliency Prediction at AIM 2024.
The goal of the participants was to develop a method for predicting accurate saliency maps for the provided set of video sequences.
arXiv Detail & Related papers (2024-09-23T08:59:22Z) - The 2nd Solution for LSVOS Challenge RVOS Track: Spatial-temporal Refinement for Consistent Semantic Segmentation [0.0]
We propose a method to enhance the temporal consistency of the referring object segmentation model.
Our method placed 2nd in the final ranking of the RVOS Track at the ECCV 2024 LSVOS Challenge.
arXiv Detail & Related papers (2024-08-22T14:43:02Z) - 2nd Place Solution for MeViS Track in CVPR 2024 PVUW Workshop: Motion Expression guided Video Segmentation [8.20168024462357]
Motion Expression guided Video is a challenging task that aims at segmenting objects in the video based on natural language expressions with motion descriptions.
We introduce mask information obtained from the video instance segmentation model as preliminary information for temporal enhancement and employ SAM for spatial refinement.
Our method achieved a score of 49.92 J &F in the validation phase and 54.20 J &F in the test phase, securing the final ranking of 2nd in the MeViS Track at the CVPR 2024 PVUW Challenge.
arXiv Detail & Related papers (2024-06-20T02:16:23Z) - Perception Test: A Diagnostic Benchmark for Multimodal Video Models [78.64546291816117]
We propose a novel multimodal video benchmark to evaluate the perception and reasoning skills of pre-trained multimodal models.
The Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities.
The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime.
arXiv Detail & Related papers (2023-05-23T07:54:37Z) - Mobile App Tasks with Iterative Feedback (MoTIF): Addressing Task
Feasibility in Interactive Visual Environments [54.405920619915655]
We introduce Mobile app Tasks with Iterative Feedback (MoTIF), a dataset with natural language commands for the greatest number of interactive environments to date.
MoTIF is the first to contain natural language requests for interactive environments that are not satisfiable.
We perform initial feasibility classification experiments and only reach an F1 score of 37.3, verifying the need for richer vision-language representations.
arXiv Detail & Related papers (2021-04-17T14:48:02Z) - The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020) [186.7816349401443]
We present a new video understanding pentathlon challenge, an open competition held in conjunction with the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2020.
The objective of the challenge was to explore and evaluate new methods for text-to-video retrieval.
arXiv Detail & Related papers (2020-08-03T09:55:26Z) - Dense-Captioning Events in Videos: SYSU Submission to ActivityNet
Challenge 2020 [8.462158729006715]
This report presents a brief description of our submission to the dense video captioning task of ActivityNet Challenge 2020.
Our approach achieves a 9.28 METEOR score on the test set.
arXiv Detail & Related papers (2020-06-21T02:38:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.