A Training Framework for Stereo-Aware Speech Enhancement using Deep
Neural Networks
- URL: http://arxiv.org/abs/2112.04939v1
- Date: Thu, 9 Dec 2021 14:13:41 GMT
- Title: A Training Framework for Stereo-Aware Speech Enhancement using Deep
Neural Networks
- Authors: Bahareh Tolooshams and Kazuhito Koishida
- Abstract summary: We propose a novel stereo-aware framework for speech enhancement.
The proposed framework is model independent, hence it can be applied to any deep learning based architecture.
We show that by regularizing for an image preservation loss, the overall performance is improved, and the stereo aspect of the speech is better preserved.
- Score: 34.012007729454815
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning-based speech enhancement has shown unprecedented performance in
recent years. The most popular mono speech enhancement frameworks are
end-to-end networks mapping the noisy mixture into an estimate of the clean
speech. With growing computational power and availability of multichannel
microphone recordings, prior works have aimed to incorporate spatial statistics
along with spectral information to boost up performance. Despite an improvement
in enhancement performance of mono output, the spatial image preservation and
subjective evaluations have not gained much attention in the literature. This
paper proposes a novel stereo-aware framework for speech enhancement, i.e., a
training loss for deep learning-based speech enhancement to preserve the
spatial image while enhancing the stereo mixture. The proposed framework is
model independent, hence it can be applied to any deep learning based
architecture. We provide an extensive objective and subjective evaluation of
the trained models through a listening test. We show that by regularizing for
an image preservation loss, the overall performance is improved, and the stereo
aspect of the speech is better preserved.
Related papers
- FINALLY: fast and universal speech enhancement with studio-like quality [7.207284147264852]
We address the challenge of speech enhancement in real-world recordings, which often contain various forms of distortion.
We study various feature extractors for perceptual loss to facilitate the stability of adversarial training.
We integrate WavLM-based perceptual loss into MS-STFT adversarial training pipeline, creating an effective and stable training procedure for the speech enhancement model.
arXiv Detail & Related papers (2024-10-08T11:16:03Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - Lip-Listening: Mixing Senses to Understand Lips using Cross Modality
Knowledge Distillation for Word-Based Models [0.03499870393443267]
This work builds on recent state-of-the-art word-based lipreading models by integrating sequence-level and frame-level Knowledge Distillation (KD) to their systems.
We propose a technique to transfer speech recognition capabilities from audio speech recognition systems to visual speech recognizers, where our goal is to utilize audio data during lipreading model training.
arXiv Detail & Related papers (2022-06-05T15:47:54Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - Learning Audio-Visual Dereverberation [87.52880019747435]
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition.
Our idea is to learn to dereverberate speech from audio-visual observations.
We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
arXiv Detail & Related papers (2021-06-14T20:01:24Z) - High Fidelity Speech Regeneration with Application to Speech Enhancement [96.34618212590301]
We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner.
Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source.
arXiv Detail & Related papers (2021-01-31T10:54:27Z) - Knowing What to Listen to: Early Attention for Deep Speech
Representation Learning [25.71206255965502]
We propose the novel Fine-grained Early Attention (FEFA) for speech signals.
This model is capable of focusing on information items as small as frequency bins.
We evaluate the proposed model on two popular tasks of speaker recognition and speech emotion recognition.
arXiv Detail & Related papers (2020-09-03T17:40:27Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.