A Closer Look at Wav2Vec2 Embeddings for On-Device Single-Channel Speech
Enhancement
- URL: http://arxiv.org/abs/2403.01369v1
- Date: Sun, 3 Mar 2024 02:05:17 GMT
- Title: A Closer Look at Wav2Vec2 Embeddings for On-Device Single-Channel Speech
Enhancement
- Authors: Ravi Shankar, Ke Tan, Buye Xu, Anurag Kumar
- Abstract summary: Self-supervised learned models have been found to be very effective for certain speech tasks.
In this paper, we investigate the uses of SSL representations for single-channel speech enhancement in challenging conditions.
- Score: 16.900731393703648
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised learned models have been found to be very effective for
certain speech tasks such as automatic speech recognition, speaker
identification, keyword spotting and others. While the features are undeniably
useful in speech recognition and associated tasks, their utility in speech
enhancement systems is yet to be firmly established, and perhaps not properly
understood. In this paper, we investigate the uses of SSL representations for
single-channel speech enhancement in challenging conditions and find that they
add very little value for the enhancement task. Our constraints are designed
around on-device real-time speech enhancement -- model is causal, the compute
footprint is small. Additionally, we focus on low SNR conditions where such
models struggle to provide good enhancement. In order to systematically examine
how SSL representations impact performance of such enhancement models, we
propose a variety of techniques to utilize these embeddings which include
different forms of knowledge-distillation and pre-training.
Related papers
- Incorporating Talker Identity Aids With Improving Speech Recognition in Adversarial Environments [0.2916558661202724]
We develop a transformer-based model that jointly performs speech recognition and speaker identification.
We show that the joint model performs comparably to Whisper under clean conditions.
Our results suggest that integrating voice representations with speech recognition can lead to more robust models under adversarial conditions.
arXiv Detail & Related papers (2024-10-07T18:39:59Z) - Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)
We present a simple yet effective automatic process for creating speech-text pair data.
Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks [94.10497337235083]
We are first to explore the potential of prompting speech LMs in the domain of speech processing.
We reformulate speech processing tasks into speech-to-unit generation tasks.
We show that the prompting method can achieve competitive performance compared to the strong fine-tuning method.
arXiv Detail & Related papers (2024-08-23T13:00:10Z) - Exploring the Benefits of Tokenization of Discrete Acoustic Units [4.591279524925446]
Tokenization algorithms merge the units of a base vocabulary into larger, variable-rate units.
We demonstrate that tokenization yields significant improvements in terms of performance, as well as training and inference speed.
arXiv Detail & Related papers (2024-06-08T18:34:28Z) - Self-supervised Neural Factor Analysis for Disentangling Utterance-level
Speech Representations [30.293081541301746]
Self-supervised learning (SSL) speech models such as wav2vec and HuBERT have demonstrated state-of-the-art performance on automatic speech recognition.
We argue that the problem is caused by the lack of disentangled representations and an utterance-level learning objective.
Our models outperform the current best model, WavLM, on all utterance-level non-semantic tasks on the SUPERB benchmark with only 20% of labeled data.
arXiv Detail & Related papers (2023-05-14T08:26:24Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - Towards Disentangled Speech Representations [65.7834494783044]
We construct a representation learning task based on joint modeling of ASR and TTS.
We seek to learn a representation of audio that disentangles that part of the speech signal that is relevant to transcription from that part which is not.
We show that enforcing these properties during training improves WER by 24.5% relative on average for our joint modeling task.
arXiv Detail & Related papers (2022-08-28T10:03:55Z) - An Exploration of Prompt Tuning on Generative Spoken Language Model for
Speech Processing Tasks [112.1942546460814]
We report the first exploration of the prompt tuning paradigm for speech processing tasks based on Generative Spoken Language Model (GSLM)
Experiment results show that the prompt tuning technique achieves competitive performance in speech classification tasks with fewer trainable parameters than fine-tuning specialized downstream models.
arXiv Detail & Related papers (2022-03-31T03:26:55Z) - Personalized Speech Enhancement: New Models and Comprehensive Evaluation [27.572537325449158]
We propose two neural networks for personalized speech enhancement (PSE) models that achieve superior performance to the previously proposed VoiceFilter.
We also create test sets that capture a variety of scenarios that users can encounter during video conferencing.
Our results show that the proposed models can yield better speech recognition accuracy, speech intelligibility, and perceptual quality than the baseline models.
arXiv Detail & Related papers (2021-10-18T21:21:23Z) - Knowing What to Listen to: Early Attention for Deep Speech
Representation Learning [25.71206255965502]
We propose the novel Fine-grained Early Attention (FEFA) for speech signals.
This model is capable of focusing on information items as small as frequency bins.
We evaluate the proposed model on two popular tasks of speaker recognition and speech emotion recognition.
arXiv Detail & Related papers (2020-09-03T17:40:27Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.