Related papers: Eta-WavLM: Efficient Speaker Identity Removal in Self-Supervised Speech Representations Using a Simple Linear Equation

Eta-WavLM: Efficient Speaker Identity Removal in Self-Supervised Speech Representations Using a Simple Linear Equation

URL: http://arxiv.org/abs/2505.19273v1
Date: Sun, 25 May 2025 19:05:26 GMT
Title: Eta-WavLM: Efficient Speaker Identity Removal in Self-Supervised Speech Representations Using a Simple Linear Equation
Authors: Giuseppe Ruggiero, Matteo Testa, Jurgen Van de Walle, Luigi Di Caro,
Abstract summary: Self-supervised learning (SSL) has reduced the reliance on expensive labeling in speech technologies by learning meaningful representations from unannotated data.<n>We propose a novel disentanglement method that linearly decomposes SSL representations into speaker-specific and speaker-independent components.
Score: 1.3874486202578669
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Self-supervised learning (SSL) has reduced the reliance on expensive labeling in speech technologies by learning meaningful representations from unannotated data. Since most SSL-based downstream tasks prioritize content information in speech, ideal representations should disentangle content from unwanted variations like speaker characteristics in the SSL representations. However, removing speaker information often degrades other speech components, and existing methods either fail to fully disentangle speaker identity or require resource-intensive models. In this paper, we propose a novel disentanglement method that linearly decomposes SSL representations into speaker-specific and speaker-independent components, effectively generating speaker disentangled representations. Comprehensive experiments show that our approach achieves speaker independence and as such, when applied to content-driven tasks such as voice conversion, our representations yield significant improvements over state-of-the-art methods.

Related papers

Self-Supervised Syllable Discovery Based on Speaker-Disentangled HuBERT [10.18337180909434]
Self-supervised speech representation learning has become essential for extracting meaningful features from untranscribed audio. We propose a speech-only self-supervised fine-tuning approach that separates syllabic units from speaker information.
arXiv Detail & Related papers (2024-09-16T09:07:08Z)
What Do Self-Supervised Speech and Speaker Models Learn? New Findings From a Cross Model Layer-Wise Analysis [44.93152068353389]
Self-supervised learning (SSL) has attracted increased attention for learning meaningful speech representations. Speaker SSL models adopt utterance-level training objectives primarily for speaker representation.
arXiv Detail & Related papers (2024-01-31T07:23:22Z)
Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction [17.05599594354308]
Speech signals are inherently complex as they encompass both global acoustic characteristics and local semantic information. In the task of target speech extraction, certain elements of global and local semantic information in the reference speech can lead to speaker confusion. We propose a self-supervised disentangled representation learning method to overcome this challenge.
arXiv Detail & Related papers (2023-12-16T03:48:24Z)
Disentangling Voice and Content with Self-Supervision for Speaker Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech. It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z)
Improving Speaker Diarization using Semantic Information: Joint Pairwise Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems. We introduce spoken language understanding modules to extract speaker-related semantic information. We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z)
Self-supervised Fine-tuning for Improved Content Representations by Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method. Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z)
Adversarial Speaker Disentanglement Using Unannotated External Data for Self-supervised Representation Based Voice Conversion [35.23123094710891]
We propose a high-similarity any-to-one voice conversion method with the input of SSL representations. Experimental results show that our proposed method achieves comparable similarity and higher naturalness than the supervised method.
arXiv Detail & Related papers (2023-05-16T04:52:29Z)
Improving Self-Supervised Speech Representations by Disentangling Speakers [56.486084431528695]
Self-supervised learning in speech involves training a speech representation network on a large-scale unannotated speech corpus. Disentangling speakers is very challenging, because removing the speaker information could easily result in a loss of content as well. We propose a new SSL method that can achieve speaker disentanglement without severe loss of content.
arXiv Detail & Related papers (2022-04-20T04:56:14Z)
UniSpeech-SAT: Universal Speech Representation Learning with Speaker Aware Pre-Training [72.004873454347]
Two methods are introduced for enhancing the unsupervised speaker information extraction. Experiment results on SUPERB benchmark show that the proposed system achieves state-of-the-art performance. We scale up training dataset to 94 thousand hours public audio data and achieve further performance improvement.
arXiv Detail & Related papers (2021-10-12T05:43:30Z)
VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement. We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training. Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.