Self-Supervised Speaker Verification with Simple Siamese Network and
Self-Supervised Regularization
- URL: http://arxiv.org/abs/2112.04459v1
- Date: Wed, 8 Dec 2021 18:41:19 GMT
- Title: Self-Supervised Speaker Verification with Simple Siamese Network and
Self-Supervised Regularization
- Authors: Mufan Sang, Haoqi Li, Fang Liu, Andrew O. Arnold, Li Wan
- Abstract summary: We propose an effective self-supervised learning framework and a novel regularization strategy to facilitate self-supervised speaker representation learning.
With our strong online data augmentation strategy, the proposed SSReg shows the potential of self-supervised learning without using negative pairs.
Comprehensive experiments on the VoxCeleb datasets demonstrate that our proposed self-supervised approach obtains a 23.4% relative improvement.
- Score: 12.892376738542383
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training speaker-discriminative and robust speaker verification systems
without speaker labels is still challenging and worthwhile to explore. In this
study, we propose an effective self-supervised learning framework and a novel
regularization strategy to facilitate self-supervised speaker representation
learning. Different from contrastive learning-based self-supervised learning
methods, the proposed self-supervised regularization (SSReg) focuses
exclusively on the similarity between the latent representations of positive
data pairs. We also explore the effectiveness of alternative online data
augmentation strategies on both the time domain and frequency domain. With our
strong online data augmentation strategy, the proposed SSReg shows the
potential of self-supervised learning without using negative pairs and it can
significantly improve the performance of self-supervised speaker representation
learning with a simple Siamese network architecture. Comprehensive experiments
on the VoxCeleb datasets demonstrate that our proposed self-supervised approach
obtains a 23.4% relative improvement by adding the effective self-supervised
regularization and outperforms other previous works.
Related papers
- A Probabilistic Model Behind Self-Supervised Learning [53.64989127914936]
In self-supervised learning (SSL), representations are learned via an auxiliary task without annotated labels.
We present a generative latent variable model for self-supervised learning.
We show that several families of discriminative SSL, including contrastive methods, induce a comparable distribution over representations.
arXiv Detail & Related papers (2024-02-02T13:31:17Z) - Semi-supervised learning made simple with self-supervised clustering [65.98152950607707]
Self-supervised learning models have been shown to learn rich visual representations without requiring human annotations.
We propose a conceptually simple yet empirically powerful approach to turn clustering-based self-supervised methods into semi-supervised learners.
arXiv Detail & Related papers (2023-06-13T01:09:18Z) - Bootstrap Equilibrium and Probabilistic Speaker Representation Learning
for Self-supervised Speaker Verification [15.652180150706002]
We propose self-supervised speaker representation learning strategies.
In the front-end, we learn the speaker representations via the bootstrap training scheme with the uniformity regularization term.
In the back-end, the probabilistic speaker embeddings are estimated by maximizing the mutual likelihood score between the speech samples belonging to the same speaker.
arXiv Detail & Related papers (2021-12-16T14:55:44Z) - Self-supervised Text-independent Speaker Verification using Prototypical
Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning.
We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z) - Self-supervised Co-training for Video Representation Learning [103.69904379356413]
We investigate the benefit of adding semantic-class positives to instance-based Info Noise Contrastive Estimation training.
We propose a novel self-supervised co-training scheme to improve the popular infoNCE loss.
We evaluate the quality of the learnt representation on two different downstream tasks: action recognition and video retrieval.
arXiv Detail & Related papers (2020-10-19T17:59:01Z) - CLAR: Contrastive Learning of Auditory Representations [6.1424670675582576]
We introduce various data augmentations suitable for auditory data and evaluate their impact on predictive performance.
We show that training with time-frequency audio features substantially improves the quality of the learned representations.
We illustrate that by combining all these methods and with substantially less labeled data, our framework (CLAR) achieves significant improvement on prediction performance.
arXiv Detail & Related papers (2020-10-19T14:15:31Z) - Augmentation adversarial training for self-supervised speaker
recognition [49.47756927090593]
We train robust speaker recognition models without speaker labels.
Experiments on VoxCeleb and VOiCES datasets show significant improvements over previous works using self-supervision.
arXiv Detail & Related papers (2020-07-23T15:49:52Z) - Improving out-of-distribution generalization via multi-task
self-supervised pretraining [48.29123326140466]
We show that features obtained using self-supervised learning are comparable to, or better than, supervised learning for domain generalization in computer vision.
We introduce a new self-supervised pretext task of predicting responses to Gabor filter banks.
arXiv Detail & Related papers (2020-03-30T14:55:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.