Related papers: Speech foundation models on intelligibility prediction for hearing-impaired listeners

Speech foundation models on intelligibility prediction for hearing-impaired listeners

URL: http://arxiv.org/abs/2401.14289v1
Date: Wed, 24 Jan 2024 18:26:52 GMT
Title: Speech foundation models on intelligibility prediction for hearing-impaired listeners
Authors: Santiago Cuervo and Ricard Marxer
Abstract summary: Speech foundation models (SFMs) have been benchmarked on many speech processing tasks. We present a systematic evaluation of 10 SFMs on one such application: Speech intelligibility prediction. We propose a simple method that learns a specialized prediction head on top of frozen SFMs to approach the problem.
Score: 4.742307809368852
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speech foundation models (SFMs) have been benchmarked on many speech processing tasks, often achieving state-of-the-art performance with minimal adaptation. However, the SFM paradigm has been significantly less explored for applications of interest to the speech perception community. In this paper we present a systematic evaluation of 10 SFMs on one such application: Speech intelligibility prediction. We focus on the non-intrusive setup of the Clarity Prediction Challenge 2 (CPC2), where the task is to predict the percentage of words correctly perceived by hearing-impaired listeners from speech-in-noise recordings. We propose a simple method that learns a lightweight specialized prediction head on top of frozen SFMs to approach the problem. Our results reveal statistically significant differences in performance across SFMs. Our method resulted in the winning submission in the CPC2, demonstrating its promise for speech perception applications.

Related papers

OWSM-Biasing: Contextualizing Open Whisper-Style Speech Models for Automatic Speech Recognition with Dynamic Vocabulary [8.171886468845049]
This paper integrates an existing contextual biasing (CB) method with Open Whisper-Style Speech Models (OWSM) v3.1 while freezing its pre-trained parameters.<n> Experimental results show that the proposed method improves the biasing word error rate (B-WER) by 11.6 points.
arXiv Detail & Related papers (2025-06-11T06:53:40Z)
Unveiling the Best Practices for Applying Speech Foundation Models to Speech Intelligibility Prediction for Hearing-Impaired People [0.0]
Speech foundation models (SFMs) have demonstrated strong performance across a variety of downstream tasks, including speech intelligibility prediction for hearing-impaired people (SIP-HI)<n>We conduct a comprehensive study to identify key design factors affecting SIP-HI performance with 5 SFMs.<n>Our findings show that, contrary to traditional use-all-layers methods, selecting a single encoder layer yields better results.
arXiv Detail & Related papers (2025-05-13T04:07:59Z)
Enhance Vision-Language Alignment with Noise [59.2608298578913]
We investigate whether the frozen model can be fine-tuned by customized noise. We propose Positive-incentive Noise (PiNI) which can fine-tune CLIP via injecting noise into both visual and text encoders.
arXiv Detail & Related papers (2024-12-14T12:58:15Z)
Two-stage Framework for Robust Speech Emotion Recognition Using Target Speaker Extraction in Human Speech Noise Conditions [25.490988931354185]
We propose a novel two-stage framework for the problem by cascading target speaker extraction (TSE) method and speech emotion recognition (SER) We first train a TSE model to extract the speech of target speaker from a mixture. Then, in the second stage, we utilize the extracted speech for SER training. Our developed system achieves a 14.33% improvement in unweighted accuracy (UA) compared to a baseline without using TSE method.
arXiv Detail & Related papers (2024-09-29T07:04:50Z)
On the Evaluation of Speech Foundation Models for Spoken Language Understanding [87.52911510306011]
The Spoken Language Understanding Evaluation (SLUE) suite of benchmark tasks was recently introduced to address the need for open resources and benchmarking. The benchmark has demonstrated preliminary success in using pre-trained speech foundation models (SFM) for these SLU tasks. We ask: which SFMs offer the most benefits for these complex SLU tasks, and what is the most effective approach for incorporating these SFMs?
arXiv Detail & Related papers (2024-06-14T14:37:52Z)
TRNet: Two-level Refinement Network leveraging Speech Enhancement for Noise Robust Speech Emotion Recognition [29.756961194844717]
The proposed TRNet substantially promotes the robustness of the proposed system in both matched and unmatched noisy environments. Results validate that the proposed system substantially promotes the robustness of the proposed system in both matched and unmatched noisy environments.
arXiv Detail & Related papers (2024-04-19T16:09:17Z)
Inference and Denoise: Causal Inference-based Neural Speech Enhancement [83.4641575757706]
This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention. The proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement modules (EMs) to perform noise-conditional SE.
arXiv Detail & Related papers (2022-11-02T15:03:50Z)
MBI-Net: A Non-Intrusive Multi-Branched Speech Intelligibility Prediction Model for Hearing Aids [22.736703635666164]
We propose a multi-branched speech intelligibility prediction model (MBI-Net) for predicting subjective intelligibility scores of hearing aid (HA) users. The outputs of the two branches are fused through a linear layer to obtain predicted speech intelligibility scores.
arXiv Detail & Related papers (2022-04-07T09:13:44Z)
Audio-visual multi-channel speech separation, dereverberation and recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach. The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches. Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z)
Improving Noise Robustness of Contrastive Speech Representation Learning with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments. We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition. We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z)
Personalized Speech Enhancement: New Models and Comprehensive Evaluation [27.572537325449158]
We propose two neural networks for personalized speech enhancement (PSE) models that achieve superior performance to the previously proposed VoiceFilter. We also create test sets that capture a variety of scenarios that users can encounter during video conferencing. Our results show that the proposed models can yield better speech recognition accuracy, speech intelligibility, and perceptual quality than the baseline models.
arXiv Detail & Related papers (2021-10-18T21:21:23Z)
LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction. We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z)
SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions. Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text. We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.