Intel Labs at Ego4D Challenge 2022: A Better Baseline for Audio-Visual
Diarization
- URL: http://arxiv.org/abs/2210.07764v3
- Date: Sun, 29 Oct 2023 19:22:11 GMT
- Title: Intel Labs at Ego4D Challenge 2022: A Better Baseline for Audio-Visual
Diarization
- Authors: Kyle Min
- Abstract summary: This report describes our approach for the Audio-Visual Diarization (AVD) task of the Ego4D Challenge 2022.
We improve the detection performance of the camera wearer's voice activity by modifying the training scheme of its model.
Second, we discover that an off-the-shelf voice activity detection model can effectively remove false positives when it is applied solely to the camera wearer's voice activities.
- Score: 3.9886149789339327
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This report describes our approach for the Audio-Visual Diarization (AVD)
task of the Ego4D Challenge 2022. Specifically, we present multiple technical
improvements over the official baselines. First, we improve the detection
performance of the camera wearer's voice activity by modifying the training
scheme of its model. Second, we discover that an off-the-shelf voice activity
detection model can effectively remove false positives when it is applied
solely to the camera wearer's voice activities. Lastly, we show that better
active speaker detection leads to a better AVD outcome. Our final method
obtains 65.9% DER on the test set of Ego4D, which significantly outperforms all
the baselines. Our submission achieved 1st place in the Ego4D Challenge 2022.
Related papers
- Speech Foundation Model Ensembles for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024 [8.940008511570207]
This work details our approach to achieving a leading system with a 1.79% pooled equal error rate (EER)
The rapid advancement of generative AI models presents significant challenges for detecting AI-generated deepfake singing voices.
The Singing Voice Deepfake Detection (SVDD) Challenge 2024 aims to address this complex task.
arXiv Detail & Related papers (2024-09-03T21:28:45Z) - QuAVF: Quality-aware Audio-Visual Fusion for Ego4D Talking to Me
Challenge [35.08570071278399]
This report describes our submission to the Ego4D Talking to Me (TTM) Challenge 2023.
We propose to use two separate models to process the input videos and audio.
With the simple architecture design, our model achieves 67.4% mean average precision (mAP) on the test set.
arXiv Detail & Related papers (2023-06-30T05:14:45Z) - STHG: Spatial-Temporal Heterogeneous Graph Learning for Advanced
Audio-Visual Diarization [3.9886149789339327]
This report introduces our novel method named STHG for the Audio-Visual Diarization task of the Ego4D Challenge 2023.
Our key innovation is that we model all the speakers in a video using a single, unified heterogeneous graph learning framework.
Our final method obtains 61.1% DER on the test set of Ego4D, which significantly outperforms all the baselines as well as last year's winner.
arXiv Detail & Related papers (2023-06-18T17:55:02Z) - AVATAR submission to the Ego4D AV Transcription Challenge [79.21857972093332]
Our pipeline is based on AVATAR, a state of the art encoder-decoder model for AV-ASR that performs early fusion of spectrograms and RGB images.
Our final method achieves a WER of 68.40 on the challenge test set, outperforming the baseline by 43.7%, and winning the challenge.
arXiv Detail & Related papers (2022-11-18T01:03:30Z) - InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges [66.62885923201543]
We present our champion solutions to five tracks at Ego4D challenge.
We leverage our developed InternVideo, a video foundation model, for five Ego4D tasks.
InternVideo-Ego4D is an effective paradigm to adapt the strong foundation model to the downstream ego-centric video understanding tasks.
arXiv Detail & Related papers (2022-11-17T13:45:06Z) - Contrastive Audio-Visual Masked Autoencoder [85.53776628515561]
Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE)
Our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound.
arXiv Detail & Related papers (2022-10-02T07:29:57Z) - UniCon+: ICTCAS-UCAS Submission to the AVA-ActiveSpeaker Task at
ActivityNet Challenge 2022 [69.67841335302576]
This report presents a brief description of our winning solution to the AVA Active Speaker Detection (ASD) task at ActivityNet Challenge 2022.
Our underlying model UniCon+ continues to build on our previous work, the Unified Context Network (UniCon) and Extended UniCon.
We augment the architecture with a simple GRU-based module that allows information of recurring identities to flow across scenes.
arXiv Detail & Related papers (2022-06-22T06:11:07Z) - Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT [37.343431783936126]
This paper investigates self-supervised pre-training for audio-visual speaker representation learning.
A visual stream showing the speaker's mouth area is used alongside speech as inputs.
We conducted extensive experiments probing the effectiveness of pre-training and visual modality.
arXiv Detail & Related papers (2022-05-15T04:48:41Z) - MERLOT Reserve: Neural Script Knowledge through Vision and Language and
Sound [90.1857707251566]
We introduce MERLOT Reserve, a model that represents videos jointly over time.
We replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet.
Our objective learns faster than alternatives, and performs well at scale.
arXiv Detail & Related papers (2022-01-07T19:00:21Z) - Robust Self-Supervised Audio-Visual Speech Recognition [29.526786921769613]
We present a self-supervised audio-visual speech recognition framework built upon Audio-Visual HuBERT (AV-HuBERT)
On the largest available AVSR benchmark dataset LRS3, our approach outperforms prior state-of-the-art by 50% (28.0% vs. 14.1%) using less than 10% of labeled data.
Our approach reduces the WER of an audio-based model by over 75% (25.8% vs. 5.8%) on average.
arXiv Detail & Related papers (2022-01-05T18:50:50Z) - Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio
Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model.
We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks.
In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.