SSPS: Self-Supervised Positive Sampling for Robust Self-Supervised Speaker Verification
- URL: http://arxiv.org/abs/2505.14561v2
- Date: Tue, 24 Jun 2025 09:06:50 GMT
- Title: SSPS: Self-Supervised Positive Sampling for Robust Self-Supervised Speaker Verification
- Authors: Theo Lepage, Reda Dehak,
- Abstract summary: Self-Supervised Positive Sampling (SSPS) is a new positive sampling technique for Speaker Verification.<n>SSPS improves SV performance for both SimCLR and DINO, reaching 2.57% and 2.53% EER.<n>SSPS achieves a 58% EER reduction by lowering intra-speaker variance, providing comparable performance to DINO-SSPS.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-Supervised Learning (SSL) has led to considerable progress in Speaker Verification (SV). The standard framework uses same-utterance positive sampling and data-augmentation to generate anchor-positive pairs of the same speaker. This is a major limitation, as this strategy primarily encodes channel information from the recording condition, shared by the anchor and positive. We propose a new positive sampling technique to address this bottleneck: Self-Supervised Positive Sampling (SSPS). For a given anchor, SSPS aims to find an appropriate positive, i.e., of the same speaker identity but a different recording condition, in the latent space using clustering assignments and a memory queue of positive embeddings. SSPS improves SV performance for both SimCLR and DINO, reaching 2.57% and 2.53% EER, outperforming SOTA SSL methods on VoxCeleb1-O. In particular, SimCLR-SSPS achieves a 58% EER reduction by lowering intra-speaker variance, providing comparable performance to DINO-SSPS.
Related papers
- Self-Supervised Frameworks for Speaker Verification via Bootstrapped Positive Sampling [0.0]
Self-Supervised Positive Sampling (SSPS) is a bootstrapped technique for sampling appropriate and diverse positives in SSL frameworks for Speaker Verification (SV)<n>SSPS achieves consistent improvements in SV performance on VoxCeleb benchmarks when implemented in major SSL frameworks, such as SimCLR, SwAV, VICReg, and DINO.<n>SSPS lowers intra-class variance and reduces channel information in speaker representations while exhibiting greater robustness without data-augmentation.
arXiv Detail & Related papers (2025-01-29T17:08:01Z) - CA-SSLR: Condition-Aware Self-Supervised Learning Representation for Generalized Speech Processing [27.828675312638296]
We introduce Condition-Aware Self-Supervised Learning Representation (CA-SSLR)<n>CA-S SLR improves the model's capabilities and demonstrates its generality on unseen tasks.<n> Experiments show that CA-S SLR reduces the number of trainable parameters, mitigates overfitting, and excels in under-resourced and unseen tasks.
arXiv Detail & Related papers (2024-12-05T18:51:10Z) - Contrastive Learning with Synthetic Positives [11.932323457691945]
Contrastive Learning with Synthetic Positives (PNCL) is a novel approach to learning with the nearest neighbor.<n>We use synthetic images, generated by an unconditional diffusion model, as additional positives to help the model learn from diverse positives.<n>These images are considered "hard" positives for the anchor image, and when included as supplementary positives in the contrastive loss, they contribute to a performance improvement of over 2% and 1% in linear evaluation.
arXiv Detail & Related papers (2024-08-30T01:47:43Z) - Additive Margin in Contrastive Self-Supervised Frameworks to Learn Discriminative Speaker Representations [0.0]
We discuss the importance of Additive Margin (AM) in SimCLR and MoCo SSL methods to further separate positive from negative pairs.
Implementing these two modifications to SimCLR improves performance and results in 7.85% EER on VoxCeleb1-O, outperforming other equivalent methods.
arXiv Detail & Related papers (2024-04-23T10:56:58Z) - Exploring the Integration of Speech Separation and Recognition with
Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end.
We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model.
A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z) - Learning Self-Supervised Low-Rank Network for Single-Stage Weakly and
Semi-Supervised Semantic Segmentation [119.009033745244]
This paper presents a Self-supervised Low-Rank Network ( SLRNet) for single-stage weakly supervised semantic segmentation (WSSS) and semi-supervised semantic segmentation (SSSS)
SLRNet uses cross-view self-supervision, that is, it simultaneously predicts several attentive LR representations from different views of an image to learn precise pseudo-labels.
Experiments on the Pascal VOC 2012, COCO, and L2ID datasets demonstrate that our SLRNet outperforms both state-of-the-art WSSS and SSSS methods with a variety of different settings.
arXiv Detail & Related papers (2022-03-19T09:19:55Z) - Trash to Treasure: Harvesting OOD Data with Cross-Modal Matching for
Open-Set Semi-Supervised Learning [101.28281124670647]
Open-set semi-supervised learning (open-set SSL) investigates a challenging but practical scenario where out-of-distribution (OOD) samples are contained in the unlabeled data.
We propose a novel training mechanism that could effectively exploit the presence of OOD data for enhanced feature learning.
Our approach substantially lifts the performance on open-set SSL and outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2021-08-12T09:14:44Z) - PARP: Prune, Adjust and Re-Prune for Self-Supervised Speech Recognition [78.67749936030219]
Prune-Adjust- Re-Prune (PARP) discovers and finetunesworks for much better ASR performance.
Experiments on low-resource English and multi-lingual ASR show sparseworks exist in pre-trained speech SSL.
arXiv Detail & Related papers (2021-06-10T17:32:25Z) - Towards Overcoming False Positives in Visual Relationship Detection [95.15011997876606]
We investigate the cause of the high false positive rate in Visual Relationship Detection (VRD)
This paper presents Spatially-Aware Balanced negative pRoposal sAmpling (SABRA) as a robust VRD framework that alleviates the influence of false positives.
arXiv Detail & Related papers (2020-12-23T06:28:00Z) - Improving Stability of LS-GANs for Audio and Speech Signals [70.15099665710336]
We show that encoding departure from normality computed in this vector space into the generator optimization formulation helps to craft more comprehensive spectrograms.
We demonstrate the effectiveness of binding this metric for enhancing stability in training with less mode collapse compared to baseline GANs.
arXiv Detail & Related papers (2020-08-12T17:41:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.