Getting More for Less: Using Weak Labels and AV-Mixup for Robust Audio-Visual Speaker Verification
- URL: http://arxiv.org/abs/2309.07115v2
- Date: Thu, 13 Jun 2024 13:08:24 GMT
- Title: Getting More for Less: Using Weak Labels and AV-Mixup for Robust Audio-Visual Speaker Verification
- Authors: Anith Selvakumar, Homa Fashandi,
- Abstract summary: We show that an auxiliary task with even weak labels can increase the quality of the learned speaker representation.
We also extend the Generalized End-to-End Loss (GE2E) to multimodal inputs and demonstrate that it can achieve competitive performance in an audio-visual space.
Our network achieves state of the art performance for speaker verification, reporting 0.244%, 0.252%, 0.441% Equal Error Rate (EER) on the VoxCeleb1-O/E/H test sets.
- Score: 0.4681661603096334
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Distance Metric Learning (DML) has typically dominated the audio-visual speaker verification problem space, owing to strong performance in new and unseen classes. In our work, we explored multitask learning techniques to further enhance DML, and show that an auxiliary task with even weak labels can increase the quality of the learned speaker representation without increasing model complexity during inference. We also extend the Generalized End-to-End Loss (GE2E) to multimodal inputs and demonstrate that it can achieve competitive performance in an audio-visual space. Finally, we introduce AV-Mixup, a multimodal augmentation technique during training time that has shown to reduce speaker overfit. Our network achieves state of the art performance for speaker verification, reporting 0.244%, 0.252%, 0.441% Equal Error Rate (EER) on the VoxCeleb1-O/E/H test sets, which is to our knowledge, the best published results on VoxCeleb1-E and VoxCeleb1-H.
Related papers
- EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning [36.012107899738524]
We introduce EquiAV, a novel framework that leverages equivariance for audio-visual contrastive learning.
Our approach begins with extending equivariance to audio-visual learning, facilitated by a shared attention-based transformation predictor.
It enables the aggregation of features from diverse augmentations into a representative embedding, providing robust supervision.
arXiv Detail & Related papers (2024-03-14T15:44:19Z) - DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification [55.306583814017046]
We present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification.
DASA generates diversified training samples in speaker embedding space with negligible extra computing cost.
The best result achieves a 14.6% relative reduction in EER metric on CN-Celeb evaluation set.
arXiv Detail & Related papers (2023-10-18T17:07:05Z) - MAViL: Masked Audio-Video Learners [68.61844803682145]
We present Masked Audio-Video learners (MAViL) to train audio-visual representations.
Pre-training with MAViL enables the model to perform well in audio-visual classification and retrieval tasks.
For the first time, a self-supervised audio-visual model outperforms ones that use external supervision on benchmarks.
arXiv Detail & Related papers (2022-12-15T18:59:59Z) - SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - Late Audio-Visual Fusion for In-The-Wild Speaker Diarization [33.0046568984949]
We propose an audio-visual diarization model which combines audio-only and visual-centric sub-systems via late fusion.
For audio, we show that an attractor-based end-to-end system (EEND-EDA) performs remarkably well when trained with our proposed recipe of a simulated proxy dataset.
We also propose an improved version, EEND-EDA++, that uses attention in decoding and a speaker recognition loss during training to better handle the larger number of speakers.
arXiv Detail & Related papers (2022-11-02T17:20:42Z) - Label-Efficient Self-Supervised Speaker Verification With Information
Maximization and Contrastive Learning [0.0]
We explore self-supervised learning for speaker verification by learning representations directly from raw audio.
Our approach is based on recent information learning frameworks and an intensive data pre-processing step.
arXiv Detail & Related papers (2022-07-12T13:01:55Z) - Best of Both Worlds: Multi-task Audio-Visual Automatic Speech
Recognition and Active Speaker Detection [9.914246432182873]
In noisy conditions, automatic speech recognition can benefit from the addition of visual signals coming from a video of the speaker's face.
Active speaker detection involves selecting at each moment in time which of the visible faces corresponds to the audio.
Recent work has shown that we can solve both problems simultaneously by employing an attention mechanism over the competing video tracks of the speakers' faces.
This work closes this gap in active speaker detection accuracy by presenting a single model that can be jointly trained with a multi-task loss.
arXiv Detail & Related papers (2022-05-10T23:03:19Z) - MERLOT Reserve: Neural Script Knowledge through Vision and Language and
Sound [90.1857707251566]
We introduce MERLOT Reserve, a model that represents videos jointly over time.
We replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet.
Our objective learns faster than alternatives, and performs well at scale.
arXiv Detail & Related papers (2022-01-07T19:00:21Z) - UniSpeech-SAT: Universal Speech Representation Learning with Speaker
Aware Pre-Training [72.004873454347]
Two methods are introduced for enhancing the unsupervised speaker information extraction.
Experiment results on SUPERB benchmark show that the proposed system achieves state-of-the-art performance.
We scale up training dataset to 94 thousand hours public audio data and achieve further performance improvement.
arXiv Detail & Related papers (2021-10-12T05:43:30Z) - A Multi-View Approach To Audio-Visual Speaker Verification [38.9710777250597]
In this study, we explore audio-visual approaches to speaker verification.
We report the lowest AV equal error rate (EER) of 0.7% on the VoxCeleb1 dataset.
This new approach achieves 28% EER on VoxCeleb1 in the challenging testing condition of cross-modal verification.
arXiv Detail & Related papers (2021-02-11T22:29:25Z) - Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio
Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model.
We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks.
In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.