Predicting speech intelligibility from EEG using a dilated convolutional
network
- URL: http://arxiv.org/abs/2105.06844v1
- Date: Fri, 14 May 2021 14:12:52 GMT
- Title: Predicting speech intelligibility from EEG using a dilated convolutional
network
- Authors: Bernd Accou, Mohammad Jalilpour Monesi, Hugo Van hamme and Tom
Francart
- Abstract summary: We present a deep-learning-based model incorporating dilated convolutions that can be used to predict speech intelligibility without subject-specific training.
Our method is the first to predict the speech reception threshold from EEG for unseen subjects, contributing to objective measures of speech intelligibility.
- Score: 17.56832530408592
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Objective: Currently, only behavioral speech understanding tests are
available, which require active participation of the person. As this is
infeasible for certain populations, an objective measure of speech
intelligibility is required. Recently, brain imaging data has been used to
establish a relationship between stimulus and brain response. Linear models
have been successfully linked to speech intelligibility but require per-subject
training. We present a deep-learning-based model incorporating dilated
convolutions that can be used to predict speech intelligibility without
subject-specific (re)training. Methods: We evaluated the performance of the
model as a function of input segment length, EEG frequency band and receptive
field size while comparing it to a baseline model. Next, we evaluated
performance on held-out data and finetuning. Finally, we established a link
between the accuracy of our model and the state-of-the-art behavioral MATRIX
test. Results: The model significantly outperformed the baseline for every
input segment length (p$\leq10^{-9}$), for all EEG frequency bands except the
theta band (p$\leq0.001$) and for receptive field sizes larger than 125~ms
(p$\leq0.05$). Additionally, finetuning significantly increased the accuracy
(p$\leq0.05$) on a held-out dataset. Finally, a significant correlation
(r=0.59, p=0.0154) was found between the speech reception threshold estimated
using the behavioral MATRIX test and our objective method. Conclusion: Our
proposed dilated convolutional model can be used as a proxy for speech
intelligibility. Significance: Our method is the first to predict the speech
reception threshold from EEG for unseen subjects, contributing to objective
measures of speech intelligibility.
Related papers
- Detecting Speech Abnormalities with a Perceiver-based Sequence
Classifier that Leverages a Universal Speech Model [4.503292461488901]
We propose a Perceiver-based sequence to detect abnormalities in speech reflective of several neurological disorders.
We combine this sequence with a Universal Speech Model (USM) that is trained (unsupervised) on 12 million hours of diverse audio recordings.
Our model outperforms standard transformer (80.9%) and perceiver (81.8%) models and achieves an average accuracy of 83.1%.
arXiv Detail & Related papers (2023-10-16T21:07:12Z) - Zero-Shot Automatic Pronunciation Assessment [19.971348810774046]
We propose a novel zero-shot APA method based on the pre-trained acoustic model, HuBERT.
Experimental results on speechocean762 demonstrate that the proposed method achieves comparable performance to supervised regression baselines.
arXiv Detail & Related papers (2023-05-31T05:17:17Z) - ASPEST: Bridging the Gap Between Active Learning and Selective
Prediction [56.001808843574395]
Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain.
Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples.
In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain.
arXiv Detail & Related papers (2023-04-07T23:51:07Z) - Nearest Neighbor Zero-Shot Inference [68.56747574377215]
kNN-Prompt is a technique to use k-nearest neighbor (kNN) retrieval augmentation for zero-shot inference with language models (LMs)
fuzzy verbalizers leverage the sparse kNN distribution for downstream tasks by automatically associating each classification label with a set of natural language tokens.
Experiments show that kNN-Prompt is effective for domain adaptation with no further training, and that the benefits of retrieval increase with the size of the model used for kNN retrieval.
arXiv Detail & Related papers (2022-05-27T07:00:59Z) - NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level
Quality [123.97136358092585]
We develop a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset.
Specifically, we leverage a variational autoencoder (VAE) for end-to-end text to waveform generation.
Experiment evaluations on popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS to human recordings at the sentence level.
arXiv Detail & Related papers (2022-05-09T16:57:35Z) - MBI-Net: A Non-Intrusive Multi-Branched Speech Intelligibility
Prediction Model for Hearing Aids [22.736703635666164]
We propose a multi-branched speech intelligibility prediction model (MBI-Net) for predicting subjective intelligibility scores of hearing aid (HA) users.
The outputs of the two branches are fused through a linear layer to obtain predicted speech intelligibility scores.
arXiv Detail & Related papers (2022-04-07T09:13:44Z) - Self-Normalized Importance Sampling for Neural Language Modeling [97.96857871187052]
In this work, we propose self-normalized importance sampling. Compared to our previous work, the criteria considered in this work are self-normalized and there is no need to further conduct a correction step.
We show that our proposed self-normalized importance sampling is competitive in both research-oriented and production-oriented automatic speech recognition tasks.
arXiv Detail & Related papers (2021-11-11T16:57:53Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Utilizing Self-supervised Representations for MOS Prediction [51.09985767946843]
Existing evaluations usually require clean references or parallel ground truth data.
Subjective tests, on the other hand, do not need any additional clean or parallel data and correlates better to human perception.
We develop an automatic evaluation approach that correlates well with human perception while not requiring ground truth data.
arXiv Detail & Related papers (2021-04-07T09:44:36Z) - Effects of Number of Filters of Convolutional Layers on Speech
Recognition Model Accuracy [6.2698513174194215]
This paper studies the effects of Number of Filters of convolutional layers on the model prediction accuracy of CNN+RNN (Convolutional Networks adding to Recurrent Networks) for ASR Models (Automatic Speech Recognition)
Experimental results show that only when the CNN Number of Filters exceeds a certain threshold value is adding CNN to RNN able to improve the performance of the CNN+RNN speech recognition model.
arXiv Detail & Related papers (2021-02-03T23:04:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.