Related papers: Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features

Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features

URL: http://arxiv.org/abs/2111.02363v5
Date: Thu, 19 Dec 2024 09:18:11 GMT
Title: Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features
Authors: Ryandhimas E. Zezario, Szu-Wei Fu, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao,
Abstract summary: We propose a cross-domain multi-objective speech assessment model called MOSA-Net, which can estimate multiple speech assessment metrics simultaneously.<n> Experimental results show that MOSA-Net can improve the linear correlation coefficient (LCC) by 0.026 (0.990 vs 0.964 in seen noise environments) and 0.012 (0.969 vs 0.957 in unseen noise environments) in perceptual evaluation of speech quality (PESQ) prediction.
Score: 31.59528815233441
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this study, we propose a cross-domain multi-objective speech assessment model called MOSA-Net, which can estimate multiple speech assessment metrics simultaneously. Experimental results show that MOSA-Net can improve the linear correlation coefficient (LCC) by 0.026 (0.990 vs 0.964 in seen noise environments) and 0.012 (0.969 vs 0.957 in unseen noise environments) in perceptual evaluation of speech quality (PESQ) prediction, compared to Quality-Net, an existing single-task model for PESQ prediction, and improve LCC by 0.021 (0.985 vs 0.964 in seen noise environments) and 0.047 (0.836 vs 0.789 in unseen noise environments) in short-time objective intelligibility (STOI) prediction, compared to STOI-Net (based on CRNN), an existing single-task model for STOI prediction. Moreover, MOSA-Net, originally trained to assess objective scores, can be used as a pre-trained model to be effectively adapted to an assessment model for predicting subjective quality and intelligibility scores with a limited amount of training data. Experimental results show that MOSA-Net can improve LCC by 0.018 (0.805 vs 0.787) in mean opinion score (MOS) prediction, compared to MOS-SSL, a strong single-task model for MOS prediction. In light of the confirmed prediction capability, we further adopt the latent representations of MOSA-Net to guide the speech enhancement (SE) process and derive a quality-intelligibility (QI)-aware SE (QIA-SE) approach accordingly. Experimental results show that QIA-SE provides superior enhancement performance compared with the baseline SE system in terms of objective evaluation metrics and qualitative evaluation test. For example, QIA-SE can improve PESQ by 0.301 (2.953 vs 2.652 in seen noise environments) and 0.18 (2.658 vs 2.478 in unseen noise environments) over a CNN-based baseline SE model.

Related papers

Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation [103.66549325018741]
We introduce two key metrics that show differences in current benchmarks.<n>We demonstrate that benchmarks with a better signal-to-noise ratio are more reliable when making decisions at small scale.<n>We conclude by recommending that those creating new benchmarks, or selecting which existing benchmarks to use, aim for high signal and low noise.
arXiv Detail & Related papers (2025-08-18T17:56:04Z)
A Probabilistic Perspective on Unlearning and Alignment for Large Language Models [48.96686419141881]
We introduce the first formal probabilistic evaluation framework for Large Language Models (LLMs) Namely, we propose novel metrics with high probability guarantees concerning the output distribution of a model. Our metrics are application-independent and allow practitioners to make more reliable estimates about model capabilities before deployment.
arXiv Detail & Related papers (2024-10-04T15:44:23Z)
Automated Speaking Assessment of Conversation Tests with Novel Graph-based Modeling on Spoken Response Coherence [11.217656140423207]
ASAC aims to evaluate the overall speaking proficiency of an L2 speaker in a setting where an interlocutor interacts with one or more candidates. We propose a hierarchical graph model that aptly incorporates both broad inter-response interactions and nuanced semantic information. Extensive experimental results on the NICT-JLE benchmark dataset suggest that our proposed modeling approach can yield considerable improvements in prediction accuracy.
arXiv Detail & Related papers (2024-09-11T07:24:07Z)
A Mixture of Experts (MoE) model to improve AI-based computational pathology prediction performance under variable levels of histopathology image blur [0.0]
We introduce a mixture of experts (MoE) strategy that combines predictions from multiple expert models trained on data with varying blur levels.<n>Our results show that baseline models' performance consistently decreased with increasing blur.<n>MoE-CNN_CLAM outperformed the baseline CNN_CLAM under moderate and mixed blur conditions.
arXiv Detail & Related papers (2024-05-15T12:40:41Z)
Feature Denoising Diffusion Model for Blind Image Quality Assessment [58.5808754919597]
Blind Image Quality Assessment (BIQA) aims to evaluate image quality in line with human perception, without reference benchmarks. Deep learning BIQA methods typically depend on using features from high-level tasks for transfer learning. In this paper, we take an initial step towards exploring the diffusion model for feature denoising in BIQA.
arXiv Detail & Related papers (2024-01-22T13:38:24Z)
HAAQI-Net: A Non-intrusive Neural Music Audio Quality Assessment Model for Hearing Aids [30.305000305766193]
This paper introduces HAAQI-Net, a non-intrusive deep learning-based music audio quality assessment model for hearing aid users. It can predict HAAQI scores directly from music audio clips and hearing loss patterns.
arXiv Detail & Related papers (2024-01-02T10:55:01Z)
Multi-Task Pseudo-Label Learning for Non-Intrusive Speech Quality Assessment Model [28.32514067707762]
This study proposes a multi-task pseudo-label learning (MPL)-based non-intrusive speech quality assessment model called MTQ-Net. MPL consists of two stages: obtaining pseudo-label scores from a pretrained model and performing multi-task learning. The MTQ-Net with the MPL approach exhibits higher overall predictive power compared to other SSL-based speech assessment models.
arXiv Detail & Related papers (2023-08-18T02:36:21Z)
Evaluation of Speech Representations for MOS prediction [0.7329200485567826]
In this paper, we evaluate feature extraction models for predicting speech quality. We also propose a model architecture to compare embeddings of supervised learning and self-supervised learning models with embeddings of speaker verification models.
arXiv Detail & Related papers (2023-06-16T17:21:42Z)
How to Estimate Model Transferability of Pre-Trained Speech Models? [84.11085139766108]
"Score-based assessment" framework for estimating transferability of pre-trained speech models. We leverage upon two representation theories, Bayesian likelihood estimation and optimal transport, to generate rank scores for the PSM candidates. Our framework efficiently computes transferability scores without actual fine-tuning of candidate models or layers.
arXiv Detail & Related papers (2023-06-01T04:52:26Z)
Analysing the Impact of Audio Quality on the Use of Naturalistic Long-Form Recordings for Infant-Directed Speech Research [62.997667081978825]
Modelling of early language acquisition aims to understand how infants bootstrap their language skills. Recent developments have enabled the use of more naturalistic training data for computational models. It is currently unclear how the sound quality could affect analyses and modelling experiments conducted on such data.
arXiv Detail & Related papers (2023-05-03T08:25:37Z)
CCATMos: Convolutional Context-aware Transformer Network for Non-intrusive Speech Quality Assessment [12.497279501767606]
We propose a novel end-to-end model structure called Convolutional Context-Aware Transformer (CCAT) network to predict the mean opinion score (MOS) of human raters. We evaluate our model on three MOS-annotated datasets spanning multiple languages and distortion types and submit our results to the ConferencingSpeech 2022 Challenge.
arXiv Detail & Related papers (2022-11-04T16:46:11Z)
CONVIQT: Contrastive Video Quality Estimator [63.749184706461826]
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms. Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner. Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.
arXiv Detail & Related papers (2022-06-29T15:22:01Z)
MBI-Net: A Non-Intrusive Multi-Branched Speech Intelligibility Prediction Model for Hearing Aids [22.736703635666164]
We propose a multi-branched speech intelligibility prediction model (MBI-Net) for predicting subjective intelligibility scores of hearing aid (HA) users. The outputs of the two branches are fused through a linear layer to obtain predicted speech intelligibility scores.
arXiv Detail & Related papers (2022-04-07T09:13:44Z)
HASA-net: A non-intrusive hearing-aid speech assessment network [52.83357278948373]
We propose a DNN-based hearing aid speech assessment network (HASA-Net) to predict speech quality and intelligibility scores simultaneously. To the best of our knowledge, HASA-Net is the first work to incorporate quality and intelligibility assessments utilizing a unified DNN-based non-intrusive model for hearing aids. Experimental results show that the predicted speech quality and intelligibility scores of HASA-Net are highly correlated to two well-known intrusive hearing-aid evaluation metrics.
arXiv Detail & Related papers (2021-11-10T14:10:13Z)
InQSS: a speech intelligibility assessment model using a multi-task learning network [21.037410575414995]
In this study, we propose InQSS, a speech intelligibility assessment model that uses both spectrogram and scattering coefficients as input features. The resulting model can predict not only the intelligibility scores but also the quality scores of a speech.
arXiv Detail & Related papers (2021-11-04T02:01:27Z)
LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction. We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z)
An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models. We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z)
MoEfication: Conditional Computation of Transformer Models for Efficient Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost. We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon. We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z)
STOI-Net: A Deep Learning based Non-Intrusive Speech Intelligibility Assessment Model [24.965732699885262]
We propose a deep learning-based non-intrusive speech intelligibility assessment model, namely STOI-Net. The model is formed by the combination of a convolutional neural network and bidirectional long short-term memory (CNN-BLSTM) architecture with a multiplicative attention mechanism. Experimental results show that the STOI score estimated by STOI-Net has a good correlation with the actual STOI score when tested with noisy and enhanced speech utterances.
arXiv Detail & Related papers (2020-11-09T09:57:10Z)
Score-informed Networks for Music Performance Assessment [64.12728872707446]
Deep neural network-based methods incorporating score information into MPA models have not yet been investigated. We introduce three different models capable of score-informed performance assessment.
arXiv Detail & Related papers (2020-08-01T07:46:24Z)
Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions. Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks. This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.