Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment
Model with Cross-Domain Features
- URL: http://arxiv.org/abs/2111.02363v1
- Date: Wed, 3 Nov 2021 17:30:43 GMT
- Title: Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment
Model with Cross-Domain Features
- Authors: Ryandhimas E. Zezario, Szu-Wei Fu, Fei Chen, Chiou-Shann Fuh, Hsin-Min
Wang, Yu Tsao
- Abstract summary: The MOSA-Net is designed to estimate speech quality, intelligibility, and distortion assessment scores based on a test speech signal as input.
We show that the MOSA-Net can precisely predict perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), and speech distortion index (BLS) scores when tested on both noisy and enhanced speech utterances.
- Score: 30.57631206882462
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this study, we propose a cross-domain multi-objective speech assessment
model, i.e., the MOSA-Net, which can estimate multiple speech assessment
metrics simultaneously. More specifically, the MOSA-Net is designed to estimate
speech quality, intelligibility, and distortion assessment scores based on a
test speech signal as input. It comprises a convolutional neural network and
bidirectional long short-term memory (CNN-BLSTM) architecture for
representation extraction, as well as a multiplicative attention layer and a
fully-connected layer for each assessment metric. In addition, cross-domain
features (spectral and time-domain features) and latent representations from
self-supervised learned models are used as inputs to combine rich acoustic
information from different speech representations to obtain more accurate
assessments. Experimental results reveal that the MOSA-Net can precisely
predict perceptual evaluation of speech quality (PESQ), short-time objective
intelligibility (STOI), and speech distortion index (SDI) scores when tested on
both noisy and enhanced speech utterances under either seen test conditions
(where the test speakers and noise types are involved in the training set) or
unseen test conditions (where the test speakers and noise types are not
involved in the training set). In light of the confirmed prediction capability,
we further adopt the latent representations of the MOSA-Net to guide the speech
enhancement (SE) process and derive a quality-intelligibility (QI)-aware SE
(QIA-SE) approach accordingly. Experimental results show that QIA-SE provides
superior enhancement performance compared with the baseline SE system in terms
of objective evaluation metrics and qualitative evaluation test.
Related papers
- Automated Speaking Assessment of Conversation Tests with Novel Graph-based Modeling on Spoken Response Coherence [11.217656140423207]
ASAC aims to evaluate the overall speaking proficiency of an L2 speaker in a setting where an interlocutor interacts with one or more candidates.
We propose a hierarchical graph model that aptly incorporates both broad inter-response interactions and nuanced semantic information.
Extensive experimental results on the NICT-JLE benchmark dataset suggest that our proposed modeling approach can yield considerable improvements in prediction accuracy.
arXiv Detail & Related papers (2024-09-11T07:24:07Z) - Analysing the Impact of Audio Quality on the Use of Naturalistic
Long-Form Recordings for Infant-Directed Speech Research [62.997667081978825]
Modelling of early language acquisition aims to understand how infants bootstrap their language skills.
Recent developments have enabled the use of more naturalistic training data for computational models.
It is currently unclear how the sound quality could affect analyses and modelling experiments conducted on such data.
arXiv Detail & Related papers (2023-05-03T08:25:37Z) - MBI-Net: A Non-Intrusive Multi-Branched Speech Intelligibility
Prediction Model for Hearing Aids [22.736703635666164]
We propose a multi-branched speech intelligibility prediction model (MBI-Net) for predicting subjective intelligibility scores of hearing aid (HA) users.
The outputs of the two branches are fused through a linear layer to obtain predicted speech intelligibility scores.
arXiv Detail & Related papers (2022-04-07T09:13:44Z) - HASA-net: A non-intrusive hearing-aid speech assessment network [52.83357278948373]
We propose a DNN-based hearing aid speech assessment network (HASA-Net) to predict speech quality and intelligibility scores simultaneously.
To the best of our knowledge, HASA-Net is the first work to incorporate quality and intelligibility assessments utilizing a unified DNN-based non-intrusive model for hearing aids.
Experimental results show that the predicted speech quality and intelligibility scores of HASA-Net are highly correlated to two well-known intrusive hearing-aid evaluation metrics.
arXiv Detail & Related papers (2021-11-10T14:10:13Z) - InQSS: a speech intelligibility assessment model using a multi-task
learning network [21.037410575414995]
In this study, we propose InQSS, a speech intelligibility assessment model that uses both spectrogram and scattering coefficients as input features.
The resulting model can predict not only the intelligibility scores but also the quality scores of a speech.
arXiv Detail & Related papers (2021-11-04T02:01:27Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - Score-informed Networks for Music Performance Assessment [64.12728872707446]
Deep neural network-based methods incorporating score information into MPA models have not yet been investigated.
We introduce three different models capable of score-informed performance assessment.
arXiv Detail & Related papers (2020-08-01T07:46:24Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.