Non-Intrusive Speech Intelligibility Prediction for Hearing-Impaired
Users using Intermediate ASR Features and Human Memory Models
- URL: http://arxiv.org/abs/2401.13611v1
- Date: Wed, 24 Jan 2024 17:31:07 GMT
- Title: Non-Intrusive Speech Intelligibility Prediction for Hearing-Impaired
Users using Intermediate ASR Features and Human Memory Models
- Authors: Rhiannon Mogridge, George Close, Robert Sutherland, Thomas Hain, Jon
Barker, Stefan Goetze, Anton Ragni
- Abstract summary: This work combines the use ofWhisper ASR decoder layer representations as neural network input features with an exemplar-based, psychologically motivated model of human memory to predict human intelligibility ratings for hearing-aid users.
Substantial performance improvement over an established intrusive HASPI baseline system is found, including on enhancement systems and listeners unseen in the training data, with a root mean squared error of 25.3 compared with the baseline of 28.7.
- Score: 29.511898279006175
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural networks have been successfully used for non-intrusive speech
intelligibility prediction. Recently, the use of feature representations
sourced from intermediate layers of pre-trained self-supervised and
weakly-supervised models has been found to be particularly useful for this
task. This work combines the use of Whisper ASR decoder layer representations
as neural network input features with an exemplar-based, psychologically
motivated model of human memory to predict human intelligibility ratings for
hearing-aid users. Substantial performance improvement over an established
intrusive HASPI baseline system is found, including on enhancement systems and
listeners unseen in the training data, with a root mean squared error of 25.3
compared with the baseline of 28.7.
Related papers
- Augmenting Unsupervised Reinforcement Learning with Self-Reference [63.68018737038331]
Humans possess the ability to draw on past experiences explicitly when learning new tasks.
We propose the Self-Reference (SR) approach, an add-on module explicitly designed to leverage historical information.
Our approach achieves state-of-the-art results in terms of Interquartile Mean (IQM) performance and Optimality Gap reduction on the Unsupervised Reinforcement Learning Benchmark.
arXiv Detail & Related papers (2023-11-16T09:07:34Z) - Deep Neural Networks Tend To Extrapolate Predictably [51.303814412294514]
neural network predictions tend to be unpredictable and overconfident when faced with out-of-distribution (OOD) inputs.
We observe that neural network predictions often tend towards a constant value as input data becomes increasingly OOD.
We show how one can leverage our insights in practice to enable risk-sensitive decision-making in the presence of OOD inputs.
arXiv Detail & Related papers (2023-10-02T03:25:32Z) - Non Intrusive Intelligibility Predictor for Hearing Impaired Individuals
using Self Supervised Speech Representations [21.237026538221404]
techniques for non-intrusive prediction of SQ ratings are extended to the prediction of intelligibility for hearing-impaired users.
It is found that self-supervised representations are useful as input features to non-intrusive prediction models.
arXiv Detail & Related papers (2023-07-25T11:42:52Z) - Phonetic and Prosody-aware Self-supervised Learning Approach for
Non-native Fluency Scoring [13.817385516193445]
Speech fluency/disfluency can be evaluated by analyzing a range of phonetic and prosodic features.
Deep neural networks are commonly trained to map fluency-related features into the human scores.
We introduce a self-supervised learning (SSL) approach that takes into account phonetic and prosody awareness for fluency scoring.
arXiv Detail & Related papers (2023-05-19T05:39:41Z) - NCTV: Neural Clamping Toolkit and Visualization for Neural Network
Calibration [66.22668336495175]
A lack of consideration for neural network calibration will not gain trust from humans.
We introduce the Neural Clamping Toolkit, the first open-source framework designed to help developers employ state-of-the-art model-agnostic calibrated models.
arXiv Detail & Related papers (2022-11-29T15:03:05Z) - Neural networks trained with SGD learn distributions of increasing
complexity [78.30235086565388]
We show that neural networks trained using gradient descent initially classify their inputs using lower-order input statistics.
We then exploit higher-order statistics only later during training.
We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of universality in learning.
arXiv Detail & Related papers (2022-11-21T15:27:22Z) - Robust Deep Neural Network Estimation for Multi-dimensional Functional
Data [0.22843885788439797]
We propose a robust estimator for the location function from multi-dimensional functional data.
The proposed estimators are based on the deep neural networks with ReLU activation function.
The proposed method is also applied to analyze 2D and 3D images of patients with Alzheimer's disease.
arXiv Detail & Related papers (2022-05-19T14:53:33Z) - HASA-net: A non-intrusive hearing-aid speech assessment network [52.83357278948373]
We propose a DNN-based hearing aid speech assessment network (HASA-Net) to predict speech quality and intelligibility scores simultaneously.
To the best of our knowledge, HASA-Net is the first work to incorporate quality and intelligibility assessments utilizing a unified DNN-based non-intrusive model for hearing aids.
Experimental results show that the predicted speech quality and intelligibility scores of HASA-Net are highly correlated to two well-known intrusive hearing-aid evaluation metrics.
arXiv Detail & Related papers (2021-11-10T14:10:13Z) - PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings,
Semi-Supervised Conversational Data, and Biased Loss [26.851416177670096]
PoCoNet is a convolutional neural network that, with the use of frequency-positional embeddings, is able to more efficiently build frequency-dependent features in the early layers.
A semi-supervised method helps increase the amount of conversational training data by pre-enhancing noisy datasets.
A new loss function biased towards preserving speech quality helps the optimization better match human perceptual opinions on speech quality.
arXiv Detail & Related papers (2020-08-11T01:24:45Z) - Characterizing Speech Adversarial Examples Using Self-Attention U-Net
Enhancement [102.48582597586233]
We present a U-Net based attention model, U-Net$_At$, to enhance adversarial speech signals.
We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks.
arXiv Detail & Related papers (2020-03-31T02:16:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.