Related papers: Label-Efficient Self-Supervised Speaker Verification With Information Maximization and Contrastive Learning

Label-Efficient Self-Supervised Speaker Verification With Information Maximization and Contrastive Learning

URL: http://arxiv.org/abs/2207.05506v1
Date: Tue, 12 Jul 2022 13:01:55 GMT
Title: Label-Efficient Self-Supervised Speaker Verification With Information Maximization and Contrastive Learning
Authors: Th\'eo Lepage and R\'eda Dehak
Abstract summary: We explore self-supervised learning for speaker verification by learning representations directly from raw audio. Our approach is based on recent information learning frameworks and an intensive data pre-processing step.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: State-of-the-art speaker verification systems are inherently dependent on some kind of human supervision as they are trained on massive amounts of labeled data. However, manually annotating utterances is slow, expensive and not scalable to the amount of data available today. In this study, we explore self-supervised learning for speaker verification by learning representations directly from raw audio. The objective is to produce robust speaker embeddings that have small intra-speaker and large inter-speaker variance. Our approach is based on recent information maximization learning frameworks and an intensive data augmentation pre-processing step. We evaluate the ability of these methods to work without contrastive samples before showing that they achieve better performance when combined with a contrastive loss. Furthermore, we conduct experiments to show that our method reaches competitive results compared to existing techniques and can get better performances compared to a supervised baseline when fine-tuned with a small portion of labeled data.

Related papers

Sequential Contrastive Audio-Visual Learning [12.848371604063168]
We propose sequential contrastive audio-visual learning (SCAV), which contrasts examples based on their non-aggregated representation space using sequential distances. Retrieval experiments with the VGGSound and Music datasets demonstrate the effectiveness of SCAV. We also show that models trained with SCAV exhibit a high degree of flexibility regarding the metric employed for retrieval, allowing them to operate on a spectrum of efficiency-accuracy trade-offs.
arXiv Detail & Related papers (2024-07-08T09:45:20Z)
Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models [52.04189118767758]
Generalization is a main issue for current audio deepfake detectors. In this paper we study the potential of large-scale pre-trained models for audio deepfake detection.
arXiv Detail & Related papers (2024-05-03T15:27:11Z)
EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning [36.012107899738524]
We introduce EquiAV, a novel framework that leverages equivariance for audio-visual contrastive learning. Our approach begins with extending equivariance to audio-visual learning, facilitated by a shared attention-based transformation predictor. It enables the aggregation of features from diverse augmentations into a representative embedding, providing robust supervision.
arXiv Detail & Related papers (2024-03-14T15:44:19Z)
One-Shot Learning as Instruction Data Prospector for Large Language Models [108.81681547472138]
textscNuggets uses one-shot learning to select high-quality instruction data from extensive datasets. We show that instruction tuning with the top 1% of examples curated by textscNuggets substantially outperforms conventional methods employing the entire dataset.
arXiv Detail & Related papers (2023-12-16T03:33:12Z)
DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification [55.306583814017046]
We present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification. DASA generates diversified training samples in speaker embedding space with negligible extra computing cost. The best result achieves a 14.6% relative reduction in EER metric on CN-Celeb evaluation set.
arXiv Detail & Related papers (2023-10-18T17:07:05Z)
Self-supervised Speaker Diarization [19.111219197011355]
This study proposes an entirely unsupervised deep-learning model for speaker diarization. Speaker embeddings are represented by an encoder trained in a self-supervised fashion using pairs of adjacent segments assumed to be of the same speaker.
arXiv Detail & Related papers (2022-04-08T16:27:14Z)
Self-supervised Speaker Recognition Training Using Human-Machine Dialogues [22.262550043863445]
We investigate how to pretrain speaker recognition models by leveraging dialogues between customers and smart-speaker devices. We propose an effective rejection mechanism that selectively learns from dialogues based on their acoustic homogeneity. Experiments demonstrate that the proposed method provides significant performance improvements, superior to earlier work.
arXiv Detail & Related papers (2022-02-07T19:44:54Z)
UniSpeech-SAT: Universal Speech Representation Learning with Speaker Aware Pre-Training [72.004873454347]
Two methods are introduced for enhancing the unsupervised speaker information extraction. Experiment results on SUPERB benchmark show that the proposed system achieves state-of-the-art performance. We scale up training dataset to 94 thousand hours public audio data and achieve further performance improvement.
arXiv Detail & Related papers (2021-10-12T05:43:30Z)
Self-supervised Text-independent Speaker Verification using Prototypical Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning. We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z)
Recognizing More Emotions with Less Data Using Self-supervised Transfer Learning [0.0]
We propose a novel transfer learning method for speech emotion recognition. With as low as 125 examples per emotion class, we were able to reach a higher accuracy than a strong baseline trained on 8 times more data.
arXiv Detail & Related papers (2020-11-11T06:18:31Z)
Augmentation adversarial training for self-supervised speaker recognition [49.47756927090593]
We train robust speaker recognition models without speaker labels. Experiments on VoxCeleb and VOiCES datasets show significant improvements over previous works using self-supervision.
arXiv Detail & Related papers (2020-07-23T15:49:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.