A Closer Look at Wav2Vec2 Embeddings for On-Device Single-Channel Speech
Enhancement
- URL: http://arxiv.org/abs/2403.01369v1
- Date: Sun, 3 Mar 2024 02:05:17 GMT
- Title: A Closer Look at Wav2Vec2 Embeddings for On-Device Single-Channel Speech
Enhancement
- Authors: Ravi Shankar, Ke Tan, Buye Xu, Anurag Kumar
- Abstract summary: Self-supervised learned models have been found to be very effective for certain speech tasks.
In this paper, we investigate the uses of SSL representations for single-channel speech enhancement in challenging conditions.
- Score: 16.900731393703648
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised learned models have been found to be very effective for
certain speech tasks such as automatic speech recognition, speaker
identification, keyword spotting and others. While the features are undeniably
useful in speech recognition and associated tasks, their utility in speech
enhancement systems is yet to be firmly established, and perhaps not properly
understood. In this paper, we investigate the uses of SSL representations for
single-channel speech enhancement in challenging conditions and find that they
add very little value for the enhancement task. Our constraints are designed
around on-device real-time speech enhancement -- model is causal, the compute
footprint is small. Additionally, we focus on low SNR conditions where such
models struggle to provide good enhancement. In order to systematically examine
how SSL representations impact performance of such enhancement models, we
propose a variety of techniques to utilize these embeddings which include
different forms of knowledge-distillation and pre-training.
Related papers
- Exploring the Benefits of Tokenization of Discrete Acoustic Units [4.591279524925446]
Tokenization algorithms merge the units of a base vocabulary into larger, variable-rate units.
We demonstrate that tokenization yields significant improvements in terms of performance, as well as training and inference speed.
arXiv Detail & Related papers (2024-06-08T18:34:28Z) - Self-supervised Neural Factor Analysis for Disentangling Utterance-level
Speech Representations [30.293081541301746]
Self-supervised learning (SSL) speech models such as wav2vec and HuBERT have demonstrated state-of-the-art performance on automatic speech recognition.
We argue that the problem is caused by the lack of disentangled representations and an utterance-level learning objective.
Our models outperform the current best model, WavLM, on all utterance-level non-semantic tasks on the SUPERB benchmark with only 20% of labeled data.
arXiv Detail & Related papers (2023-05-14T08:26:24Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - SPADE: Self-supervised Pretraining for Acoustic DisEntanglement [2.294014185517203]
We introduce a self-supervised approach to disentangle room acoustics from speech.
Our results demonstrate that our proposed approach significantly improves performance over a baseline when labeled training data is scarce.
arXiv Detail & Related papers (2023-02-03T01:36:38Z) - Towards Disentangled Speech Representations [65.7834494783044]
We construct a representation learning task based on joint modeling of ASR and TTS.
We seek to learn a representation of audio that disentangles that part of the speech signal that is relevant to transcription from that part which is not.
We show that enforcing these properties during training improves WER by 24.5% relative on average for our joint modeling task.
arXiv Detail & Related papers (2022-08-28T10:03:55Z) - An Exploration of Prompt Tuning on Generative Spoken Language Model for
Speech Processing Tasks [112.1942546460814]
We report the first exploration of the prompt tuning paradigm for speech processing tasks based on Generative Spoken Language Model (GSLM)
Experiment results show that the prompt tuning technique achieves competitive performance in speech classification tasks with fewer trainable parameters than fine-tuning specialized downstream models.
arXiv Detail & Related papers (2022-03-31T03:26:55Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - Personalized Speech Enhancement: New Models and Comprehensive Evaluation [27.572537325449158]
We propose two neural networks for personalized speech enhancement (PSE) models that achieve superior performance to the previously proposed VoiceFilter.
We also create test sets that capture a variety of scenarios that users can encounter during video conferencing.
Our results show that the proposed models can yield better speech recognition accuracy, speech intelligibility, and perceptual quality than the baseline models.
arXiv Detail & Related papers (2021-10-18T21:21:23Z) - UniSpeech-SAT: Universal Speech Representation Learning with Speaker
Aware Pre-Training [72.004873454347]
Two methods are introduced for enhancing the unsupervised speaker information extraction.
Experiment results on SUPERB benchmark show that the proposed system achieves state-of-the-art performance.
We scale up training dataset to 94 thousand hours public audio data and achieve further performance improvement.
arXiv Detail & Related papers (2021-10-12T05:43:30Z) - Knowing What to Listen to: Early Attention for Deep Speech
Representation Learning [25.71206255965502]
We propose the novel Fine-grained Early Attention (FEFA) for speech signals.
This model is capable of focusing on information items as small as frequency bins.
We evaluate the proposed model on two popular tasks of speaker recognition and speech emotion recognition.
arXiv Detail & Related papers (2020-09-03T17:40:27Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.