First Place Solution to the Multiple-choice Video QA Track of The Second Perception Test Challenge
- URL: http://arxiv.org/abs/2409.13538v1
- Date: Fri, 20 Sep 2024 14:31:13 GMT
- Title: First Place Solution to the Multiple-choice Video QA Track of The Second Perception Test Challenge
- Authors: Yingzhe Peng, Yixiao Yuan, Zitian Ao, Huapeng Zhou, Kangqi Wang, Qipeng Zhu, Xu Yang,
- Abstract summary: We present our first-place solution to the Multiple-choice Video Question Answering track of The Second Perception Test Challenge.
This competition posed a complex video understanding task, requiring models to accurately comprehend and answer questions about video content.
- Score: 4.075139470537149
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this report, we present our first-place solution to the Multiple-choice Video Question Answering (QA) track of The Second Perception Test Challenge. This competition posed a complex video understanding task, requiring models to accurately comprehend and answer questions about video content. To address this challenge, we leveraged the powerful QwenVL2 (7B) model and fine-tune it on the provided training set. Additionally, we employed model ensemble strategies and Test Time Augmentation to boost performance. Through continuous optimization, our approach achieved a Top-1 Accuracy of 0.7647 on the leaderboard.
Related papers
- VQA$^2$:Visual Question Answering for Video Quality Assessment [76.81110038738699]
Video Quality Assessment originally focused on quantitative video quality scoring.
It is now evolving towards more comprehensive visual quality understanding tasks.
We introduce the first visual question answering instruction dataset entirely focuses on video quality assessment.
We conduct extensive experiments on both video quality scoring and video quality understanding tasks.
arXiv Detail & Related papers (2024-11-06T09:39:52Z) - AIM 2024 Challenge on Video Super-Resolution Quality Assessment: Methods and Results [76.64868221556145]
This paper presents the Video Super-Resolution (SR) Quality Assessment (QA) Challenge that was part of the Advances in Image Manipulation (AIM) workshop.
The task of this challenge was to develop an objective QA method for videos upscaled 2x and 4x by modern image- and video-SR algorithms.
The goal was to advance the state-of-the-art in SR QA, which had proven to be a challenging problem with limited applicability of traditional QA methods.
arXiv Detail & Related papers (2024-10-05T16:42:23Z) - The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6 -- Grounded videoQA [3.38659196496483]
Our research reveals that the fixed official baseline method for video question answering involves two main steps: visual grounding and object tracking.
A significant challenge emerges during the initial step, where selected frames may lack clearly identifiable target objects.
arXiv Detail & Related papers (2024-07-02T03:13:27Z) - A Boosted Model Ensembling Approach to Ball Action Spotting in Videos:
The Runner-Up Solution to CVPR'23 SoccerNet Challenge [13.784332796429556]
This report presents our solution to Ball Action Spotting in videos.
Our method reached second place in the CVPR'23 SoccerNet Challenge.
arXiv Detail & Related papers (2023-06-09T09:25:48Z) - The Runner-up Solution for YouTube-VIS Long Video Challenge 2022 [72.13080661144761]
We adopt the previously proposed online video instance segmentation method IDOL for this challenge.
We use pseudo labels to further help contrastive learning, so as to obtain more temporally consistent instance embedding.
The proposed method obtains 40.2 AP on the YouTube-VIS 2022 long video dataset and was ranked second in this challenge.
arXiv Detail & Related papers (2022-11-18T01:40:59Z) - ReLER@ZJU-Alibaba Submission to the Ego4D Natural Language Queries
Challenge 2022 [61.81899056005645]
Given a video clip and a text query, the goal of this challenge is to locate a temporal moment of the video clip where the answer to the query can be obtained.
We propose a multi-scale cross-modal transformer and a video frame-level contrastive loss to fully uncover the correlation between language queries and video clips.
The experimental results demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2022-07-01T12:48:35Z) - NTIRE 2020 Challenge on Video Quality Mapping: Methods and Results [131.05847851975236]
This paper reviews the NTIRE 2020 challenge on video quality mapping (VQM)
The challenge includes both a supervised track (track 1) and a weakly-supervised track (track 2) for two benchmark datasets.
For track 1, in total 7 teams competed in the final test phase, demonstrating novel and effective solutions to the problem.
For track 2, some existing methods are evaluated, showing promising solutions to the weakly-supervised video quality mapping problem.
arXiv Detail & Related papers (2020-05-05T15:45:16Z) - AIM 2019 Challenge on Video Temporal Super-Resolution: Methods and
Results [129.15554076593762]
This paper reviews the first AIM challenge on video temporal super-resolution (frame)
From low-frame-rate (15 fps) video sequences, the challenge participants are asked to submit higher-framerate (60 fps) video sequences.
We employ the REDS VTSR dataset derived from diverse videos captured in a hand-held camera for training and evaluation purposes.
arXiv Detail & Related papers (2020-05-04T01:51:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.