Improving Self-Supervised Speech Representations by Disentangling
Speakers
- URL: http://arxiv.org/abs/2204.09224v1
- Date: Wed, 20 Apr 2022 04:56:14 GMT
- Title: Improving Self-Supervised Speech Representations by Disentangling
Speakers
- Authors: Kaizhi Qian, Yang Zhang, Heting Gao, Junrui Ni, Cheng-I Lai, David
Cox, Mark Hasegawa-Johnson, Shiyu Chang
- Abstract summary: Self-supervised learning in speech involves training a speech representation network on a large-scale unannotated speech corpus.
Disentangling speakers is very challenging, because removing the speaker information could easily result in a loss of content as well.
We propose a new SSL method that can achieve speaker disentanglement without severe loss of content.
- Score: 56.486084431528695
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised learning in speech involves training a speech representation
network on a large-scale unannotated speech corpus, and then applying the
learned representations to downstream tasks. Since the majority of the
downstream tasks of SSL learning in speech largely focus on the content
information in speech, the most desirable speech representations should be able
to disentangle unwanted variations, such as speaker variations, from the
content. However, disentangling speakers is very challenging, because removing
the speaker information could easily result in a loss of content as well, and
the damage of the latter usually far outweighs the benefit of the former. In
this paper, we propose a new SSL method that can achieve speaker
disentanglement without severe loss of content. Our approach is adapted from
the HuBERT framework, and incorporates disentangling mechanisms to regularize
both the teacher labels and the learned representations. We evaluate the
benefit of speaker disentanglement on a set of content-related downstream
tasks, and observe a consistent and notable performance advantage of our
speaker-disentangled representations.
Related papers
- Self-Powered LLM Modality Expansion for Large Speech-Text Models [62.27700381806554]
Large language models (LLMs) exhibit remarkable performance across diverse tasks.
This study aims to refine the use of speech datasets for LSM training by addressing the limitations of vanilla instruction tuning.
We introduce a self-powered LSM that leverages augmented automatic speech recognition data generated by the model itself for more effective instruction tuning.
arXiv Detail & Related papers (2024-10-04T04:34:24Z) - Self-Supervised Syllable Discovery Based on Speaker-Disentangled HuBERT [10.18337180909434]
Self-supervised speech representation learning has become essential for extracting meaningful features from untranscribed audio.
We propose a speech-only self-supervised fine-tuning approach that separates syllabic units from speaker information.
arXiv Detail & Related papers (2024-09-16T09:07:08Z) - What Do Self-Supervised Speech and Speaker Models Learn? New Findings
From a Cross Model Layer-Wise Analysis [44.93152068353389]
Self-supervised learning (SSL) has attracted increased attention for learning meaningful speech representations.
Speaker SSL models adopt utterance-level training objectives primarily for speaker representation.
arXiv Detail & Related papers (2024-01-31T07:23:22Z) - Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction [17.05599594354308]
Speech signals are inherently complex as they encompass both global acoustic characteristics and local semantic information.
In the task of target speech extraction, certain elements of global and local semantic information in the reference speech can lead to speaker confusion.
We propose a self-supervised disentangled representation learning method to overcome this challenge.
arXiv Detail & Related papers (2023-12-16T03:48:24Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Self-supervised Fine-tuning for Improved Content Representations by
Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method.
Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.