Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment
Model with Cross-Domain Features
- URL: http://arxiv.org/abs/2111.02363v1
- Date: Wed, 3 Nov 2021 17:30:43 GMT
- Title: Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment
Model with Cross-Domain Features
- Authors: Ryandhimas E. Zezario, Szu-Wei Fu, Fei Chen, Chiou-Shann Fuh, Hsin-Min
Wang, Yu Tsao
- Abstract summary: The MOSA-Net is designed to estimate speech quality, intelligibility, and distortion assessment scores based on a test speech signal as input.
We show that the MOSA-Net can precisely predict perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), and speech distortion index (BLS) scores when tested on both noisy and enhanced speech utterances.
- Score: 30.57631206882462
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this study, we propose a cross-domain multi-objective speech assessment
model, i.e., the MOSA-Net, which can estimate multiple speech assessment
metrics simultaneously. More specifically, the MOSA-Net is designed to estimate
speech quality, intelligibility, and distortion assessment scores based on a
test speech signal as input. It comprises a convolutional neural network and
bidirectional long short-term memory (CNN-BLSTM) architecture for
representation extraction, as well as a multiplicative attention layer and a
fully-connected layer for each assessment metric. In addition, cross-domain
features (spectral and time-domain features) and latent representations from
self-supervised learned models are used as inputs to combine rich acoustic
information from different speech representations to obtain more accurate
assessments. Experimental results reveal that the MOSA-Net can precisely
predict perceptual evaluation of speech quality (PESQ), short-time objective
intelligibility (STOI), and speech distortion index (SDI) scores when tested on
both noisy and enhanced speech utterances under either seen test conditions
(where the test speakers and noise types are involved in the training set) or
unseen test conditions (where the test speakers and noise types are not
involved in the training set). In light of the confirmed prediction capability,
we further adopt the latent representations of the MOSA-Net to guide the speech
enhancement (SE) process and derive a quality-intelligibility (QI)-aware SE
(QIA-SE) approach accordingly. Experimental results show that QIA-SE provides
superior enhancement performance compared with the baseline SE system in terms
of objective evaluation metrics and qualitative evaluation test.
Related papers
- A Probabilistic Perspective on Unlearning and Alignment for Large Language Models [48.96686419141881]
We introduce the first formal probabilistic evaluation framework for Large Language Models (LLMs)
Namely, we propose novel metrics with high probability guarantees concerning the output distribution of a model.
Our metrics are application-independent and allow practitioners to make more reliable estimates about model capabilities before deployment.
arXiv Detail & Related papers (2024-10-04T15:44:23Z) - Feature Denoising Diffusion Model for Blind Image Quality Assessment [58.5808754919597]
Blind Image Quality Assessment (BIQA) aims to evaluate image quality in line with human perception, without reference benchmarks.
Deep learning BIQA methods typically depend on using features from high-level tasks for transfer learning.
In this paper, we take an initial step towards exploring the diffusion model for feature denoising in BIQA.
arXiv Detail & Related papers (2024-01-22T13:38:24Z) - HAAQI-Net: A Non-intrusive Neural Music Audio Quality Assessment Model for Hearing Aids [30.305000305766193]
This paper introduces HAAQI-Net, a non-intrusive deep learning-based music audio quality assessment model for hearing aid users.
It can predict HAAQI scores directly from music audio clips and hearing loss patterns.
arXiv Detail & Related papers (2024-01-02T10:55:01Z) - Multi-Task Pseudo-Label Learning for Non-Intrusive Speech Quality
Assessment Model [28.32514067707762]
This study proposes a multi-task pseudo-label learning (MPL)-based non-intrusive speech quality assessment model called MTQ-Net.
MPL consists of two stages: obtaining pseudo-label scores from a pretrained model and performing multi-task learning.
The MTQ-Net with the MPL approach exhibits higher overall predictive power compared to other SSL-based speech assessment models.
arXiv Detail & Related papers (2023-08-18T02:36:21Z) - Evaluation of Speech Representations for MOS prediction [0.7329200485567826]
In this paper, we evaluate feature extraction models for predicting speech quality.
We also propose a model architecture to compare embeddings of supervised learning and self-supervised learning models with embeddings of speaker verification models.
arXiv Detail & Related papers (2023-06-16T17:21:42Z) - How to Estimate Model Transferability of Pre-Trained Speech Models? [84.11085139766108]
"Score-based assessment" framework for estimating transferability of pre-trained speech models.
We leverage upon two representation theories, Bayesian likelihood estimation and optimal transport, to generate rank scores for the PSM candidates.
Our framework efficiently computes transferability scores without actual fine-tuning of candidate models or layers.
arXiv Detail & Related papers (2023-06-01T04:52:26Z) - CCATMos: Convolutional Context-aware Transformer Network for
Non-intrusive Speech Quality Assessment [12.497279501767606]
We propose a novel end-to-end model structure called Convolutional Context-Aware Transformer (CCAT) network to predict the mean opinion score (MOS) of human raters.
We evaluate our model on three MOS-annotated datasets spanning multiple languages and distortion types and submit our results to the ConferencingSpeech 2022 Challenge.
arXiv Detail & Related papers (2022-11-04T16:46:11Z) - CONVIQT: Contrastive Video Quality Estimator [63.749184706461826]
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms.
Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner.
Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.
arXiv Detail & Related papers (2022-06-29T15:22:01Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z) - STOI-Net: A Deep Learning based Non-Intrusive Speech Intelligibility
Assessment Model [24.965732699885262]
We propose a deep learning-based non-intrusive speech intelligibility assessment model, namely STOI-Net.
The model is formed by the combination of a convolutional neural network and bidirectional long short-term memory (CNN-BLSTM) architecture with a multiplicative attention mechanism.
Experimental results show that the STOI score estimated by STOI-Net has a good correlation with the actual STOI score when tested with noisy and enhanced speech utterances.
arXiv Detail & Related papers (2020-11-09T09:57:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.