PAAPLoss: A Phonetic-Aligned Acoustic Parameter Loss for Speech
Enhancement
- URL: http://arxiv.org/abs/2302.08095v1
- Date: Thu, 16 Feb 2023 05:17:06 GMT
- Title: PAAPLoss: A Phonetic-Aligned Acoustic Parameter Loss for Speech
Enhancement
- Authors: Muqiao Yang, Joseph Konan, David Bick, Yunyang Zeng, Shuo Han, Anurag
Kumar, Shinji Watanabe, Bhiksha Raj
- Abstract summary: We propose a learning objective that formalizes differences in perceptual quality.
We identify temporal acoustic parameters that are non-differentiable.
We develop a neural network estimator that can accurately predict their time-series values.
- Score: 41.872384434583466
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite rapid advancement in recent years, current speech enhancement models
often produce speech that differs in perceptual quality from real clean speech.
We propose a learning objective that formalizes differences in perceptual
quality, by using domain knowledge of acoustic-phonetics. We identify temporal
acoustic parameters -- such as spectral tilt, spectral flux, shimmer, etc. --
that are non-differentiable, and we develop a neural network estimator that can
accurately predict their time-series values across an utterance. We also model
phoneme-specific weights for each feature, as the acoustic parameters are known
to show different behavior in different phonemes. We can add this criterion as
an auxiliary loss to any model that produces speech, to optimize speech outputs
to match the values of clean speech in these features. Experimentally we show
that it improves speech enhancement workflows in both time-domain and
time-frequency domain, as measured by standard evaluation metrics. We also
provide an analysis of phoneme-dependent improvement on acoustic parameters,
demonstrating the additional interpretability that our method provides. This
analysis can suggest which features are currently the bottleneck for
improvement.
Related papers
- Analysing the Impact of Audio Quality on the Use of Naturalistic
Long-Form Recordings for Infant-Directed Speech Research [62.997667081978825]
Modelling of early language acquisition aims to understand how infants bootstrap their language skills.
Recent developments have enabled the use of more naturalistic training data for computational models.
It is currently unclear how the sound quality could affect analyses and modelling experiments conducted on such data.
arXiv Detail & Related papers (2023-05-03T08:25:37Z) - TAPLoss: A Temporal Acoustic Parameter Loss for Speech Enhancement [41.872384434583466]
We provide a differentiable estimator for four categories of low-level acoustic descriptors involving: frequency-related parameters, energy or amplitude-related parameters, spectral balance parameters, and temporal features.
We show that adding TAP as an auxiliary objective in speech enhancement produces speech with improved perceptual quality and intelligibility.
arXiv Detail & Related papers (2023-02-16T04:57:11Z) - Improving Speech Enhancement through Fine-Grained Speech Characteristics [42.49874064240742]
We propose a novel approach to speech enhancement aimed at improving perceptual quality and naturalness of enhanced signals.
We first identify key acoustic parameters that have been found to correlate well with voice quality.
We then propose objective functions which are aimed at reducing the difference between clean speech and enhanced speech with respect to these features.
arXiv Detail & Related papers (2022-07-01T07:04:28Z) - MOSRA: Joint Mean Opinion Score and Room Acoustics Speech Quality
Assessment [12.144133923535714]
This paper presents MOSRA: a non-intrusive multi-dimensional speech quality metric.
It can predict room acoustics parameters alongside the overall mean opinion score (MOS) for speech quality.
We also show that this joint training method enhances the blind estimation of room acoustics.
arXiv Detail & Related papers (2022-04-04T09:38:15Z) - Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis [68.76620947298595]
Text does not fully specify the spoken form, so text-to-speech models must be able to learn from speech data that vary in ways not explained by the corresponding text.
We propose a model that generates speech explicitly conditioned on the three primary acoustic correlates of prosody.
arXiv Detail & Related papers (2021-06-15T18:03:48Z) - Learning robust speech representation with an articulatory-regularized
variational autoencoder [13.541055956177937]
We develop an articulatory model able to associate articulatory parameters describing the jaw, tongue, lips and velum configurations with vocal tract shapes and spectral features.
We show that this articulatory constraint improves model training by decreasing time to convergence and reconstruction loss at convergence, and yields better performance in a speech denoising task.
arXiv Detail & Related papers (2021-04-07T15:47:04Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z) - High Fidelity Speech Regeneration with Application to Speech Enhancement [96.34618212590301]
We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner.
Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source.
arXiv Detail & Related papers (2021-01-31T10:54:27Z) - Joint Blind Room Acoustic Characterization From Speech And Music Signals
Using Convolutional Recurrent Neural Networks [13.12834490248018]
Reverberation time, clarity, and direct-to-reverberant ratio are acoustic parameters that have been defined to describe reverberant environments.
Recent audio combined with machine learning suggests that one could estimate those parameters blindly using speech or music signals.
We propose a robust end-to-end method to achieve blind joint acoustic parameter estimation using speech and/or music signals.
arXiv Detail & Related papers (2020-10-21T17:41:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.