Self-Supervised Frameworks for Speaker Verification via Bootstrapped Positive Sampling
- URL: http://arxiv.org/abs/2501.17772v1
- Date: Wed, 29 Jan 2025 17:08:01 GMT
- Title: Self-Supervised Frameworks for Speaker Verification via Bootstrapped Positive Sampling
- Authors: Theo Lepage, Reda Dehak,
- Abstract summary: This article introduces Self-Supervised Positive Sampling (SSPS), a bootstrapped technique for sampling appropriate and diverse positives in Self-Supervised Learning (SSL) frameworks for Speaker Verification (SV)
SSPS achieves consistent improvements in SV performance on VoxCeleb benchmarks when implemented in major SSL frameworks, such as SimCLR, SwAV, VICReg, and DINO.
- Score: 0.0
- License:
- Abstract: Recent developments in Self-Supervised Learning (SSL) have demonstrated significant potential for Speaker Verification (SV), but closing the performance gap with supervised systems remains an ongoing challenge. Standard SSL frameworks rely on anchor-positive pairs extracted from the same audio utterances. Hence, positives have channel characteristics similar to those of their corresponding anchors, even with extensive data-augmentation. Therefore, this positive sampling strategy is a fundamental limitation as it encodes too much information regarding the recording source in the learned representations. This article introduces Self-Supervised Positive Sampling (SSPS), a bootstrapped technique for sampling appropriate and diverse positives in SSL frameworks for SV. SSPS samples positives close to their anchor in the representation space, as we assume that these pseudo-positives belong to the same speaker identity but correspond to different recording conditions. This method demonstrates consistent improvements in SV performance on VoxCeleb benchmarks when implemented in major SSL frameworks, such as SimCLR, SwAV, VICReg, and DINO. Using SSPS, SimCLR, and DINO achieve 2.57% and 2.53% EER on VoxCeleb1-O. SimCLR yields a 58% relative reduction in EER, getting comparable performance to DINO with a simpler training framework. Furthermore, SSPS lowers intra-class variance and reduces channel information in speaker representations while exhibiting greater robustness without data-augmentation.
Related papers
- Clustering Properties of Self-Supervised Learning [14.756786256090704]
Self-supervised learning (SSL) methods have proven remarkably effective at capturing semantically rich representations with strong clustering properties.
We propose a novel positive-feedback SSL method, termed Representation Soft Assignment (ReSA), which leverages the model's clustering properties to promote learning in a self-guided manner.
arXiv Detail & Related papers (2025-01-30T16:05:35Z) - Contrastive Learning with Synthetic Positives [11.932323457691945]
Contrastive learning with the nearest neighbor has proved to be one of the most efficient self-supervised learning (SSL) techniques.
In this paper, we introduce a novel approach called Contrastive Learning with Synthetic Positives (NCLP)
NCLP utilizes synthetic images, generated by an unconditional diffusion model, as the additional positives to help the model learn from diverse positives.
arXiv Detail & Related papers (2024-08-30T01:47:43Z) - Towards Supervised Performance on Speaker Verification with Self-Supervised Learning by Leveraging Large-Scale ASR Models [0.0]
Speech representations from large-scale ASR models contain valuable speaker information.
We propose a framework to learn speaker representations in an SSL context by fine-tuning a pre-trained WavLM with a supervised loss.
Our method achieves 0.99% EER on VoxCeleb1-O, establishing the new state-of-the-art on self-supervised SV.
arXiv Detail & Related papers (2024-06-04T12:58:19Z) - Additive Margin in Contrastive Self-Supervised Frameworks to Learn Discriminative Speaker Representations [0.0]
We discuss the importance of Additive Margin (AM) in SimCLR and MoCo SSL methods to further separate positive from negative pairs.
Implementing these two modifications to SimCLR improves performance and results in 7.85% EER on VoxCeleb1-O, outperforming other equivalent methods.
arXiv Detail & Related papers (2024-04-23T10:56:58Z) - Improving Self-Supervised Learning by Characterizing Idealized
Representations [155.1457170539049]
We prove necessary and sufficient conditions for any task invariant to given data augmentations.
For contrastive learning, our framework prescribes simple but significant improvements to previous methods.
For non-contrastive learning, we use our framework to derive a simple and novel objective.
arXiv Detail & Related papers (2022-09-13T18:01:03Z) - Learning Self-Supervised Low-Rank Network for Single-Stage Weakly and
Semi-Supervised Semantic Segmentation [119.009033745244]
This paper presents a Self-supervised Low-Rank Network ( SLRNet) for single-stage weakly supervised semantic segmentation (WSSS) and semi-supervised semantic segmentation (SSSS)
SLRNet uses cross-view self-supervision, that is, it simultaneously predicts several attentive LR representations from different views of an image to learn precise pseudo-labels.
Experiments on the Pascal VOC 2012, COCO, and L2ID datasets demonstrate that our SLRNet outperforms both state-of-the-art WSSS and SSSS methods with a variety of different settings.
arXiv Detail & Related papers (2022-03-19T09:19:55Z) - Trash to Treasure: Harvesting OOD Data with Cross-Modal Matching for
Open-Set Semi-Supervised Learning [101.28281124670647]
Open-set semi-supervised learning (open-set SSL) investigates a challenging but practical scenario where out-of-distribution (OOD) samples are contained in the unlabeled data.
We propose a novel training mechanism that could effectively exploit the presence of OOD data for enhanced feature learning.
Our approach substantially lifts the performance on open-set SSL and outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2021-08-12T09:14:44Z) - Inter-class Discrepancy Alignment for Face Recognition [55.578063356210144]
We propose a unified framework calledInter-class DiscrepancyAlignment(IDA)
IDA-DAO is used to align the similarity scores considering the discrepancy between the images and its neighbors.
IDA-SSE can provide convincing inter-class neighbors by introducing virtual candidate images generated with GAN.
arXiv Detail & Related papers (2021-03-02T08:20:08Z) - Towards Overcoming False Positives in Visual Relationship Detection [95.15011997876606]
We investigate the cause of the high false positive rate in Visual Relationship Detection (VRD)
This paper presents Spatially-Aware Balanced negative pRoposal sAmpling (SABRA) as a robust VRD framework that alleviates the influence of false positives.
arXiv Detail & Related papers (2020-12-23T06:28:00Z) - Improving Stability of LS-GANs for Audio and Speech Signals [70.15099665710336]
We show that encoding departure from normality computed in this vector space into the generator optimization formulation helps to craft more comprehensive spectrograms.
We demonstrate the effectiveness of binding this metric for enhancing stability in training with less mode collapse compared to baseline GANs.
arXiv Detail & Related papers (2020-08-12T17:41:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.