A Systematic Comparison of Phonetic Aware Techniques for Speech
Enhancement
- URL: http://arxiv.org/abs/2206.11000v1
- Date: Wed, 22 Jun 2022 12:00:50 GMT
- Title: A Systematic Comparison of Phonetic Aware Techniques for Speech
Enhancement
- Authors: Or Tal, Moshe Mandel, Felix Kreuk, Yossi Adi
- Abstract summary: We compare different methods of incorporating phonetic information in a speech enhancement model.
We observe the influence of different phonetic content models as well as various feature-injection techniques on enhancement performance.
- Score: 20.329872147913584
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech enhancement has seen great improvement in recent years using
end-to-end neural networks. However, most models are agnostic to the spoken
phonetic content. Recently, several studies suggested phonetic-aware speech
enhancement, mostly using perceptual supervision. Yet, injecting phonetic
features during model optimization can take additional forms (e.g., model
conditioning). In this paper, we conduct a systematic comparison between
different methods of incorporating phonetic information in a speech enhancement
model. By conducting a series of controlled experiments, we observe the
influence of different phonetic content models as well as various
feature-injection techniques on enhancement performance, considering both
causal and non-causal models. Specifically, we evaluate three settings for
injecting phonetic information, namely: i) feature conditioning; ii) perceptual
supervision; and iii) regularization. Phonetic features are obtained using an
intermediate layer of either a supervised pre-trained Automatic Speech
Recognition (ASR) model or by using a pre-trained Self-Supervised Learning
(SSL) model. We further observe the effect of choosing different embedding
layers on performance, considering both manual and learned configurations.
Results suggest that using a SSL model as phonetic features outperforms the ASR
one in most cases. Interestingly, the conditioning setting performs best among
the evaluated configurations.
Related papers
- Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback [50.84142264245052]
This work introduces the Align-SLM framework to enhance the semantic understanding of textless Spoken Language Models (SLMs)
Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO)
We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation.
arXiv Detail & Related papers (2024-11-04T06:07:53Z) - Developing Acoustic Models for Automatic Speech Recognition in Swedish [6.5458610824731664]
This paper is concerned with automatic continuous speech recognition using trainable systems.
The aim of this work is to build acoustic models for spoken Swedish.
arXiv Detail & Related papers (2024-04-25T12:03:14Z) - Ensemble knowledge distillation of self-supervised speech models [84.69577440755457]
Distilled self-supervised models have shown competitive performance and efficiency in recent years.
We performed Ensemble Knowledge Distillation (EKD) on various self-supervised speech models such as HuBERT, RobustHuBERT, and WavLM.
Our method improves the performance of the distilled models on four downstream speech processing tasks.
arXiv Detail & Related papers (2023-02-24T17:15:39Z) - Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition [66.94463981654216]
We propose prompt tuning methods of Deep Neural Networks (DNNs) for speaker-adaptive Visual Speech Recognition (VSR)
We finetune prompts on adaptation data of target speakers instead of modifying the pre-trained model parameters.
The effectiveness of the proposed method is evaluated on both word- and sentence-level VSR databases.
arXiv Detail & Related papers (2023-02-16T06:01:31Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - An Investigation of End-to-End Models for Robust Speech Recognition [20.998349142078805]
We present a comparison of speech enhancement-based techniques and three different model-based adaptation techniques for robust automatic speech recognition.
While adversarial learning is the best-performing technique on certain noise types, it comes at the cost of degrading clean speech WER.
On other relatively stationary noise types, a new speech enhancement technique outperformed all the model-based adaptation techniques.
arXiv Detail & Related papers (2021-02-11T19:47:13Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Knowing What to Listen to: Early Attention for Deep Speech
Representation Learning [25.71206255965502]
We propose the novel Fine-grained Early Attention (FEFA) for speech signals.
This model is capable of focusing on information items as small as frequency bins.
We evaluate the proposed model on two popular tasks of speaker recognition and speech emotion recognition.
arXiv Detail & Related papers (2020-09-03T17:40:27Z) - Phoneme Boundary Detection using Learnable Segmental Features [31.203969460341817]
Phoneme boundary detection plays an essential first step for a variety of speech processing applications.
We propose a neural architecture coupled with a parameterized structured loss function to learn segmental representations for the task of phoneme boundary detection.
arXiv Detail & Related papers (2020-02-11T14:03:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.