AVATAR submission to the Ego4D AV Transcription Challenge
- URL: http://arxiv.org/abs/2211.09966v1
- Date: Fri, 18 Nov 2022 01:03:30 GMT
- Title: AVATAR submission to the Ego4D AV Transcription Challenge
- Authors: Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid
- Abstract summary: Our pipeline is based on AVATAR, a state of the art encoder-decoder model for AV-ASR that performs early fusion of spectrograms and RGB images.
Our final method achieves a WER of 68.40 on the challenge test set, outperforming the baseline by 43.7%, and winning the challenge.
- Score: 79.21857972093332
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this report, we describe our submission to the Ego4D AudioVisual (AV)
Speech Transcription Challenge 2022. Our pipeline is based on AVATAR, a state
of the art encoder-decoder model for AV-ASR that performs early fusion of
spectrograms and RGB images. We describe the datasets, experimental settings
and ablations. Our final method achieves a WER of 68.40 on the challenge test
set, outperforming the baseline by 43.7%, and winning the challenge.
Related papers
- AIM 2024 Challenge on Video Super-Resolution Quality Assessment: Methods and Results [76.64868221556145]
This paper presents the Video Super-Resolution (SR) Quality Assessment (QA) Challenge that was part of the Advances in Image Manipulation (AIM) workshop.
The task of this challenge was to develop an objective QA method for videos upscaled 2x and 4x by modern image- and video-SR algorithms.
The goal was to advance the state-of-the-art in SR QA, which had proven to be a challenging problem with limited applicability of traditional QA methods.
arXiv Detail & Related papers (2024-10-05T16:42:23Z) - Technical Report for CVPR 2024 WeatherProof Dataset Challenge: Semantic Segmentation on Paired Real Data [9.128113804878959]
This challenge aims at semantic segmentation of images degraded by various degrees of weather from all around the world.
We introduced a pre-trained large-scale vision foundation model: InternImage, and trained it using images with different levels of noise.
As a result, we achieved 2nd place in the challenge with 45.1 mIOU and fewer submissions than the other winners.
arXiv Detail & Related papers (2024-06-09T17:08:07Z) - NTIRE 2024 Quality Assessment of AI-Generated Content Challenge [141.37864527005226]
The challenge is divided into the image track and the video track.
The winning methods in both tracks have demonstrated superior prediction performance on AIGC.
arXiv Detail & Related papers (2024-04-25T15:36:18Z) - NTIRE 2023 Quality Assessment of Video Enhancement Challenge [97.809937484099]
This paper reports on the NTIRE 2023 Quality Assessment of Video Enhancement Challenge.
The challenge is to address a major challenge in the field of video processing, namely, video quality assessment (VQA) for enhanced videos.
The challenge has a total of 167 registered participants.
arXiv Detail & Related papers (2023-07-19T02:33:42Z) - OxfordVGG Submission to the EGO4D AV Transcription Challenge [81.13727731938582]
This report presents the technical details of our submission on the EGO4D Audio-Visual (AV) Automatic Speech Recognition Challenge 2023 from the OxfordVGG team.
We present WhisperX, a system for efficient speech transcription of long-form audio with word-level time alignment, along with two text normalisers which are publicly available.
Our final submission obtained 56.2% of the Word Error Rate (WER) on the challenge test set, ranked 1st on the leaderboard.
arXiv Detail & Related papers (2023-07-18T06:48:39Z) - QuAVF: Quality-aware Audio-Visual Fusion for Ego4D Talking to Me
Challenge [35.08570071278399]
This report describes our submission to the Ego4D Talking to Me (TTM) Challenge 2023.
We propose to use two separate models to process the input videos and audio.
With the simple architecture design, our model achieves 67.4% mean average precision (mAP) on the test set.
arXiv Detail & Related papers (2023-06-30T05:14:45Z) - STHG: Spatial-Temporal Heterogeneous Graph Learning for Advanced
Audio-Visual Diarization [3.9886149789339327]
This report introduces our novel method named STHG for the Audio-Visual Diarization task of the Ego4D Challenge 2023.
Our key innovation is that we model all the speakers in a video using a single, unified heterogeneous graph learning framework.
Our final method obtains 61.1% DER on the test set of Ego4D, which significantly outperforms all the baselines as well as last year's winner.
arXiv Detail & Related papers (2023-06-18T17:55:02Z) - Intel Labs at Ego4D Challenge 2022: A Better Baseline for Audio-Visual
Diarization [3.9886149789339327]
This report describes our approach for the Audio-Visual Diarization (AVD) task of the Ego4D Challenge 2022.
We improve the detection performance of the camera wearer's voice activity by modifying the training scheme of its model.
Second, we discover that an off-the-shelf voice activity detection model can effectively remove false positives when it is applied solely to the camera wearer's voice activities.
arXiv Detail & Related papers (2022-10-14T12:54:03Z) - NTIRE 2020 Challenge on Real Image Denoising: Dataset, Methods and
Results [181.2861509946241]
This paper reviews the NTIRE 2020 challenge on real image denoising with focus on the newly introduced dataset.
The challenge is a new version of the previous NTIRE 2019 challenge on real image denoising that was based on the SIDD benchmark.
arXiv Detail & Related papers (2020-05-08T15:46:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.