MOSRA: Joint Mean Opinion Score and Room Acoustics Speech Quality
Assessment
- URL: http://arxiv.org/abs/2204.01345v1
- Date: Mon, 4 Apr 2022 09:38:15 GMT
- Title: MOSRA: Joint Mean Opinion Score and Room Acoustics Speech Quality
Assessment
- Authors: Karl El Hajal, Milos Cernak, Pablo Mainar
- Abstract summary: This paper presents MOSRA: a non-intrusive multi-dimensional speech quality metric.
It can predict room acoustics parameters alongside the overall mean opinion score (MOS) for speech quality.
We also show that this joint training method enhances the blind estimation of room acoustics.
- Score: 12.144133923535714
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The acoustic environment can degrade speech quality during communication
(e.g., video call, remote presentation, outside voice recording), and its
impact is often unknown. Objective metrics for speech quality have proven
challenging to develop given the multi-dimensionality of factors that affect
speech quality and the difficulty of collecting labeled data. Hypothesizing the
impact of acoustics on speech quality, this paper presents MOSRA: a
non-intrusive multi-dimensional speech quality metric that can predict room
acoustics parameters (SNR, STI, T60, DRR, and C50) alongside the overall mean
opinion score (MOS) for speech quality. By explicitly optimizing the model to
learn these room acoustics parameters, we can extract more informative features
and improve the generalization for the MOS task when the training data is
limited. Furthermore, we also show that this joint training method enhances the
blind estimation of room acoustics, improving the performance of current
state-of-the-art models. An additional side-effect of this joint prediction is
the improvement in the explainability of the predictions, which is a valuable
feature for many applications.
Related papers
- Towards Robust Transcription: Exploring Noise Injection Strategies for Training Data Augmentation [55.752737615873464]
This study investigates the impact of white noise at various Signal-to-Noise Ratio (SNR) levels on state-of-the-art APT models.
We hope this research provides valuable insights as preliminary work toward developing transcription models that maintain consistent performance across a range of acoustic conditions.
arXiv Detail & Related papers (2024-10-18T02:31:36Z) - Explaining Deep Learning Embeddings for Speech Emotion Recognition by Predicting Interpretable Acoustic Features [5.678610585849838]
Pre-trained deep learning embeddings have consistently shown superior performance over handcrafted acoustic features in speech emotion recognition.
Unlike acoustic features with clear physical meaning, these embeddings lack clear interpretability.
This paper proposes a modified probing approach to explain deep learning embeddings in the speech emotion space.
arXiv Detail & Related papers (2024-09-14T19:18:56Z) - Assessing the Generalization Gap of Learning-Based Speech Enhancement
Systems in Noisy and Reverberant Environments [0.7366405857677227]
Generalization to unseen conditions is typically assessed by testing the system with a new speech, noise or room impulse response database.
The present study introduces a generalization assessment framework that uses a reference model trained on the test condition.
The proposed framework is applied to evaluate the generalization potential of a feedforward neural network (FFNN), ConvTasNet, DCCRN and MANNER.
arXiv Detail & Related papers (2023-09-12T12:51:12Z) - Analysing the Impact of Audio Quality on the Use of Naturalistic
Long-Form Recordings for Infant-Directed Speech Research [62.997667081978825]
Modelling of early language acquisition aims to understand how infants bootstrap their language skills.
Recent developments have enabled the use of more naturalistic training data for computational models.
It is currently unclear how the sound quality could affect analyses and modelling experiments conducted on such data.
arXiv Detail & Related papers (2023-05-03T08:25:37Z) - PAAPLoss: A Phonetic-Aligned Acoustic Parameter Loss for Speech
Enhancement [41.872384434583466]
We propose a learning objective that formalizes differences in perceptual quality.
We identify temporal acoustic parameters that are non-differentiable.
We develop a neural network estimator that can accurately predict their time-series values.
arXiv Detail & Related papers (2023-02-16T05:17:06Z) - TAPLoss: A Temporal Acoustic Parameter Loss for Speech Enhancement [41.872384434583466]
We provide a differentiable estimator for four categories of low-level acoustic descriptors involving: frequency-related parameters, energy or amplitude-related parameters, spectral balance parameters, and temporal features.
We show that adding TAP as an auxiliary objective in speech enhancement produces speech with improved perceptual quality and intelligibility.
arXiv Detail & Related papers (2023-02-16T04:57:11Z) - Inference and Denoise: Causal Inference-based Neural Speech Enhancement [83.4641575757706]
This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention.
The proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement modules (EMs) to perform noise-conditional SE.
arXiv Detail & Related papers (2022-11-02T15:03:50Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment
Model with Cross-Domain Features [30.57631206882462]
The MOSA-Net is designed to estimate speech quality, intelligibility, and distortion assessment scores based on a test speech signal as input.
We show that the MOSA-Net can precisely predict perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), and speech distortion index (BLS) scores when tested on both noisy and enhanced speech utterances.
arXiv Detail & Related papers (2021-11-03T17:30:43Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.