Improving Perceptual Quality by Phone-Fortified Perceptual Loss using
Wasserstein Distance for Speech Enhancement
- URL: http://arxiv.org/abs/2010.15174v3
- Date: Tue, 27 Apr 2021 08:14:56 GMT
- Title: Improving Perceptual Quality by Phone-Fortified Perceptual Loss using
Wasserstein Distance for Speech Enhancement
- Authors: Tsun-An Hsieh, Cheng Yu, Szu-Wei Fu, Xugang Lu, and Yu Tsao
- Abstract summary: We propose a phone-fortified perceptual loss (PFPL) that takes phonetic information into account for training SE models.
To effectively incorporate the phonetic information, the PFPL is computed based on latent representations of the wav2vec model.
Our experimental results first reveal that the PFPL is more correlated with the perceptual evaluation metrics, as compared to signal-level losses.
- Score: 23.933935913913043
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech enhancement (SE) aims to improve speech quality and intelligibility,
which are both related to a smooth transition in speech segments that may carry
linguistic information, e.g. phones and syllables. In this study, we propose a
novel phone-fortified perceptual loss (PFPL) that takes phonetic information
into account for training SE models. To effectively incorporate the phonetic
information, the PFPL is computed based on latent representations of the
wav2vec model, a powerful self-supervised encoder that renders rich phonetic
information. To more accurately measure the distribution distances of the
latent representations, the PFPL adopts the Wasserstein distance as the
distance measure. Our experimental results first reveal that the PFPL is more
correlated with the perceptual evaluation metrics, as compared to signal-level
losses. Moreover, the results showed that the PFPL can enable a deep complex
U-Net SE model to achieve highly competitive performance in terms of
standardized quality and intelligibility evaluations on the Voice Bank-DEMAND
dataset.
Related papers
- Psychoacoustic Challenges Of Speech Enhancement On VoIP Platforms [19.122454483635615]
The study draws upon the Deep Noise Suppression (DNS) 2020 dataset, ensuring a structured examination tailored to various denoising settings and receiver interfaces.
A methodological novelty is introduced via Blinder-Oaxaca decomposition, traditionally an econometric tool, repurposed herein to analyze acoustic-phonetic perturbations within VoIP systems.
In addition to the primary findings, a multitude of metrics are reported, extending the research purview.
arXiv Detail & Related papers (2023-10-11T03:19:22Z) - Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Private Language Model Adaptation for Speech Recognition [15.726921748859393]
Speech model adaptation is crucial to handle the discrepancy between server-side proxy training data and actual data received on users' local devices.
We introduce an efficient approach on continuously adapting neural network language models (NNLMs) on private devices with applications on automatic speech recognition.
arXiv Detail & Related papers (2021-09-28T00:15:43Z) - Pre-training for Spoken Language Understanding with Joint Textual and
Phonetic Representation Learning [4.327558819000435]
We propose a novel joint textual-phonetic pre-training approach for learning spoken language representations.
Experimental results on spoken language understanding benchmarks, Fluent Speech Commands and SNIPS, show that the proposed approach significantly outperforms strong baseline models.
arXiv Detail & Related papers (2021-04-21T05:19:13Z) - Bridging the Gap Between Clean Data Training and Real-World Inference
for Spoken Language Understanding [76.89426311082927]
Existing models are trained on clean data, which causes a textitgap between clean data training and real-world inference.
We propose a method from the perspective of domain adaptation, by which both high- and low-quality samples are embedding into similar vector space.
Experiments on the widely-used dataset, Snips, and large scale in-house dataset (10 million training examples) demonstrate that this method not only outperforms the baseline models on real-world (noisy) corpus but also enhances the robustness, that is, it produces high-quality results under a noisy environment.
arXiv Detail & Related papers (2021-04-13T17:54:33Z) - Improving Speech Enhancement Performance by Leveraging Contextual Broad
Phonetic Class Information [33.79855927394387]
We explore the contextual information of articulatory attributes as additional information to further benefit speech enhancement.
We propose to improve the SE performance by leveraging losses from an end-to-end automatic speech recognition model.
Experimental results from speech denoising, speech dereverb, and impaired speech enhancement tasks confirmed that contextual BPC information improves SE performance.
arXiv Detail & Related papers (2020-11-15T03:56:37Z) - Speaker Representation Learning using Global Context Guided Channel and
Time-Frequency Transformations [67.18006078950337]
We use the global context information to enhance important channels and recalibrate salient time-frequency locations.
The proposed modules, together with a popular ResNet based model, are evaluated on the VoxCeleb1 dataset.
arXiv Detail & Related papers (2020-09-02T01:07:29Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.