Speech foundation models on intelligibility prediction for
hearing-impaired listeners
- URL: http://arxiv.org/abs/2401.14289v1
- Date: Wed, 24 Jan 2024 18:26:52 GMT
- Title: Speech foundation models on intelligibility prediction for
hearing-impaired listeners
- Authors: Santiago Cuervo and Ricard Marxer
- Abstract summary: Speech foundation models (SFMs) have been benchmarked on many speech processing tasks.
We present a systematic evaluation of 10 SFMs on one such application: Speech intelligibility prediction.
We propose a simple method that learns a specialized prediction head on top of frozen SFMs to approach the problem.
- Score: 4.742307809368852
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech foundation models (SFMs) have been benchmarked on many speech
processing tasks, often achieving state-of-the-art performance with minimal
adaptation. However, the SFM paradigm has been significantly less explored for
applications of interest to the speech perception community. In this paper we
present a systematic evaluation of 10 SFMs on one such application: Speech
intelligibility prediction. We focus on the non-intrusive setup of the Clarity
Prediction Challenge 2 (CPC2), where the task is to predict the percentage of
words correctly perceived by hearing-impaired listeners from speech-in-noise
recordings. We propose a simple method that learns a lightweight specialized
prediction head on top of frozen SFMs to approach the problem. Our results
reveal statistically significant differences in performance across SFMs. Our
method resulted in the winning submission in the CPC2, demonstrating its
promise for speech perception applications.
Related papers
- Two-stage Framework for Robust Speech Emotion Recognition Using Target Speaker Extraction in Human Speech Noise Conditions [25.490988931354185]
We propose a novel two-stage framework for the problem by cascading target speaker extraction (TSE) method and speech emotion recognition (SER)
We first train a TSE model to extract the speech of target speaker from a mixture. Then, in the second stage, we utilize the extracted speech for SER training.
Our developed system achieves a 14.33% improvement in unweighted accuracy (UA) compared to a baseline without using TSE method.
arXiv Detail & Related papers (2024-09-29T07:04:50Z) - On the Evaluation of Speech Foundation Models for Spoken Language Understanding [87.52911510306011]
The Spoken Language Understanding Evaluation (SLUE) suite of benchmark tasks was recently introduced to address the need for open resources and benchmarking.
The benchmark has demonstrated preliminary success in using pre-trained speech foundation models (SFM) for these SLU tasks.
We ask: which SFMs offer the most benefits for these complex SLU tasks, and what is the most effective approach for incorporating these SFMs?
arXiv Detail & Related papers (2024-06-14T14:37:52Z) - TRNet: Two-level Refinement Network leveraging Speech Enhancement for Noise Robust Speech Emotion Recognition [29.756961194844717]
The proposed TRNet substantially promotes the robustness of the proposed system in both matched and unmatched noisy environments.
Results validate that the proposed system substantially promotes the robustness of the proposed system in both matched and unmatched noisy environments.
arXiv Detail & Related papers (2024-04-19T16:09:17Z) - Inference and Denoise: Causal Inference-based Neural Speech Enhancement [83.4641575757706]
This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention.
The proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement modules (EMs) to perform noise-conditional SE.
arXiv Detail & Related papers (2022-11-02T15:03:50Z) - MBI-Net: A Non-Intrusive Multi-Branched Speech Intelligibility
Prediction Model for Hearing Aids [22.736703635666164]
We propose a multi-branched speech intelligibility prediction model (MBI-Net) for predicting subjective intelligibility scores of hearing aid (HA) users.
The outputs of the two branches are fused through a linear layer to obtain predicted speech intelligibility scores.
arXiv Detail & Related papers (2022-04-07T09:13:44Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Personalized Speech Enhancement: New Models and Comprehensive Evaluation [27.572537325449158]
We propose two neural networks for personalized speech enhancement (PSE) models that achieve superior performance to the previously proposed VoiceFilter.
We also create test sets that capture a variety of scenarios that users can encounter during video conferencing.
Our results show that the proposed models can yield better speech recognition accuracy, speech intelligibility, and perceptual quality than the baseline models.
arXiv Detail & Related papers (2021-10-18T21:21:23Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.