Non-Intrusive Speech Intelligibility Prediction for Hearing Aids using Whisper and Metadata
- URL: http://arxiv.org/abs/2309.09548v2
- Date: Thu, 13 Jun 2024 14:50:49 GMT
- Title: Non-Intrusive Speech Intelligibility Prediction for Hearing Aids using Whisper and Metadata
- Authors: Ryandhimas E. Zezario, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao,
- Abstract summary: We present three novel methods to improve intelligibility prediction accuracy.
MBI-Net+ is an enhanced version of MBI-Net, the top-performing system in the 1st Clarity Prediction Challenge.
- Score: 28.260347585185176
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated speech intelligibility assessment is pivotal for hearing aid (HA) development. In this paper, we present three novel methods to improve intelligibility prediction accuracy and introduce MBI-Net+, an enhanced version of MBI-Net, the top-performing system in the 1st Clarity Prediction Challenge. MBI-Net+ leverages Whisper's embeddings to create cross-domain acoustic features and includes metadata from speech signals by using a classifier that distinguishes different enhancement methods. Furthermore, MBI-Net+ integrates the hearing-aid speech perception index (HASPI) as a supplementary metric into the objective function to further boost prediction performance. Experimental results demonstrate that MBI-Net+ surpasses several intrusive baseline systems and MBI-Net on the Clarity Prediction Challenge 2023 dataset, validating the effectiveness of incorporating Whisper embeddings, speech metadata, and related complementary metrics to improve prediction performance for HA.
Related papers
- Non-Intrusive Speech Intelligibility Prediction for Hearing-Impaired
Users using Intermediate ASR Features and Human Memory Models [29.511898279006175]
This work combines the use ofWhisper ASR decoder layer representations as neural network input features with an exemplar-based, psychologically motivated model of human memory to predict human intelligibility ratings for hearing-aid users.
Substantial performance improvement over an established intrusive HASPI baseline system is found, including on enhancement systems and listeners unseen in the training data, with a root mean squared error of 25.3 compared with the baseline of 28.7.
arXiv Detail & Related papers (2024-01-24T17:31:07Z) - DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification [55.306583814017046]
We present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification.
DASA generates diversified training samples in speaker embedding space with negligible extra computing cost.
The best result achieves a 14.6% relative reduction in EER metric on CN-Celeb evaluation set.
arXiv Detail & Related papers (2023-10-18T17:07:05Z) - MBI-Net: A Non-Intrusive Multi-Branched Speech Intelligibility
Prediction Model for Hearing Aids [22.736703635666164]
We propose a multi-branched speech intelligibility prediction model (MBI-Net) for predicting subjective intelligibility scores of hearing aid (HA) users.
The outputs of the two branches are fused through a linear layer to obtain predicted speech intelligibility scores.
arXiv Detail & Related papers (2022-04-07T09:13:44Z) - MetricGAN+/-: Increasing Robustness of Noise Reduction on Unseen Data [26.94528951545861]
We propose a "de-generator" which attempts to improve the robustness of the prediction network.
Experimental results on the VoiceBank-DEMAND dataset show relative improvement in PESQ score of 3.8%.
arXiv Detail & Related papers (2022-03-23T12:42:28Z) - HASA-net: A non-intrusive hearing-aid speech assessment network [52.83357278948373]
We propose a DNN-based hearing aid speech assessment network (HASA-Net) to predict speech quality and intelligibility scores simultaneously.
To the best of our knowledge, HASA-Net is the first work to incorporate quality and intelligibility assessments utilizing a unified DNN-based non-intrusive model for hearing aids.
Experimental results show that the predicted speech quality and intelligibility scores of HASA-Net are highly correlated to two well-known intrusive hearing-aid evaluation metrics.
arXiv Detail & Related papers (2021-11-10T14:10:13Z) - Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment
Model with Cross-Domain Features [30.57631206882462]
The MOSA-Net is designed to estimate speech quality, intelligibility, and distortion assessment scores based on a test speech signal as input.
We show that the MOSA-Net can precisely predict perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), and speech distortion index (BLS) scores when tested on both noisy and enhanced speech utterances.
arXiv Detail & Related papers (2021-11-03T17:30:43Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Bayesian Learning for Deep Neural Network Adaptation [57.70991105736059]
A key task for speech recognition systems is to reduce the mismatch between training and evaluation data that is often attributable to speaker differences.
Model-based speaker adaptation approaches often require sufficient amounts of target speaker data to ensure robustness.
This paper proposes a full Bayesian learning based DNN speaker adaptation framework to model speaker-dependent (SD) parameter uncertainty.
arXiv Detail & Related papers (2020-12-14T12:30:41Z) - Improving Speech Enhancement Performance by Leveraging Contextual Broad
Phonetic Class Information [33.79855927394387]
We explore the contextual information of articulatory attributes as additional information to further benefit speech enhancement.
We propose to improve the SE performance by leveraging losses from an end-to-end automatic speech recognition model.
Experimental results from speech denoising, speech dereverb, and impaired speech enhancement tasks confirmed that contextual BPC information improves SE performance.
arXiv Detail & Related papers (2020-11-15T03:56:37Z) - Characterizing Speech Adversarial Examples Using Self-Attention U-Net
Enhancement [102.48582597586233]
We present a U-Net based attention model, U-Net$_At$, to enhance adversarial speech signals.
We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks.
arXiv Detail & Related papers (2020-03-31T02:16:34Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.