Self-Supervised Learning for Personalized Speech Enhancement
- URL: http://arxiv.org/abs/2104.02017v1
- Date: Mon, 5 Apr 2021 17:12:51 GMT
- Title: Self-Supervised Learning for Personalized Speech Enhancement
- Authors: Aswin Sivaraman, Minje Kim
- Abstract summary: Speech enhancement systems can show improved performance by adapting the model towards a single test-time speaker.
Test-time user might only provide a small amount of noise-free speech data, likely insufficient for traditional fully-supervised learning.
We propose self-supervised methods that are designed specifically to learn personalized and discriminative features from abundant in-the-wild noisy, but still personal speech recordings.
- Score: 25.05285328404576
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech enhancement systems can show improved performance by adapting the
model towards a single test-time speaker. In this personalization context, the
test-time user might only provide a small amount of noise-free speech data,
likely insufficient for traditional fully-supervised learning. One way to
overcome the lack of personal data is to transfer the model parameters from a
speaker-agnostic model to initialize the personalized model, and then to
finetune the model using the small amount of personal speech data. This
baseline marginally adapts over the scarce clean speech data. Alternatively, we
propose self-supervised methods that are designed specifically to learn
personalized and discriminative features from abundant in-the-wild noisy, but
still personal speech recordings. Our experiment shows that the proposed
self-supervised learning methods initialize personalized speech enhancement
models better than the baseline fully-supervised methods, yielding superior
speech enhancement performance. The proposed methods also result in a more
robust feature set under the real-world conditions: compressed model sizes and
fewness of the labeled data.
Related papers
- Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)
We present a simple yet effective automatic process for creating speech-text pair data.
Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - Speechworthy Instruction-tuned Language Models [71.8586707840169]
We show that both prompting and preference learning increase the speech-suitability of popular instruction-tuned LLMs.
We share lexical, syntactical, and qualitative analyses to showcase how each method contributes to improving the speech-suitability of generated responses.
arXiv Detail & Related papers (2024-09-23T02:34:42Z) - Natural language guidance of high-fidelity text-to-speech with synthetic
annotations [13.642358232817342]
We propose a scalable method for labeling various aspects of speaker identity, style, and recording conditions.
We then apply this method to a 45k hour dataset, which we use to train a speech language model.
Our results demonstrate high-fidelity speech generation in a diverse range of accents, prosodic styles, channel conditions, and acoustic conditions.
arXiv Detail & Related papers (2024-02-02T21:29:34Z) - SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation [56.913182262166316]
Chain-of-Information Generation (CoIG) is a method for decoupling semantic and perceptual information in large-scale speech generation.
SpeechGPT-Gen is efficient in semantic and perceptual information modeling.
It markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
arXiv Detail & Related papers (2024-01-24T15:25:01Z) - Exploring Speech Recognition, Translation, and Understanding with
Discrete Speech Units: A Comparative Study [68.88536866933038]
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies.
Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations.
Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length.
arXiv Detail & Related papers (2023-09-27T17:21:13Z) - Feature Normalization for Fine-tuning Self-Supervised Models in Speech
Enhancement [19.632358491434697]
Large, pre-trained representation models trained using self-supervised learning have gained popularity in various fields of machine learning.
In this paper, we investigate the feasibility of using pre-trained speech representation models for a downstream speech enhancement task.
Our proposed method enables significant improvements in speech quality compared to baselines when combined with various types of pre-trained speech models.
arXiv Detail & Related papers (2023-06-14T10:03:33Z) - Self-supervised Fine-tuning for Improved Content Representations by
Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method.
Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z) - A Single Self-Supervised Model for Many Speech Modalities Enables
Zero-Shot Modality Transfer [31.028408352051684]
We present u-HuBERT, a self-supervised pre-training framework that can leverage both multimodal and unimodal speech.
Our single model yields 1.2%/1.4%/27.2% speech recognition word error rate on LRS3 with audio-visual/audio/visual input.
arXiv Detail & Related papers (2022-07-14T16:21:33Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Test-Time Adaptation Toward Personalized Speech Enhancement: Zero-Shot
Learning with Knowledge Distillation [26.39206098000297]
We propose a novel personalized speech enhancement method to adapt a compact denoising model to the test-time specificity.
Our goal in this test-time adaptation is to utilize no clean speech target of the test speaker.
Instead of the missing clean utterance target, we distill the more advanced denoising results from an overly large teacher model.
arXiv Detail & Related papers (2021-05-08T00:42:03Z) - Personalized Speech Enhancement through Self-Supervised Data
Augmentation and Purification [24.596224536399326]
We train an SNR predictor model to estimate the frame-by-frame SNR of the pseudo-sources.
We empirically show that the proposed data purification step improves the usability of the speaker-specific noisy data.
arXiv Detail & Related papers (2021-04-05T17:17:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.