Non-Intrusive Binaural Speech Intelligibility Prediction from Discrete
Latent Representations
- URL: http://arxiv.org/abs/2111.12531v1
- Date: Wed, 24 Nov 2021 14:55:04 GMT
- Title: Non-Intrusive Binaural Speech Intelligibility Prediction from Discrete
Latent Representations
- Authors: Alex F. McKinney, Benjamin Cauchi
- Abstract summary: Speech intelligibility prediction from signals is useful in many applications.
Measures specifically designed to take into account the properties of the signal are often intrusive.
This paper proposes a non-intrusive SI measure that computes features from an input signal using a combination of vector quantization (VQ) and contrastive predictive coding (CPC) methods.
- Score: 1.1472707084860878
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Non-intrusive speech intelligibility (SI) prediction from binaural signals is
useful in many applications. However, most existing signal-based measures are
designed to be applied to single-channel signals. Measures specifically
designed to take into account the binaural properties of the signal are often
intrusive - characterised by requiring access to a clean speech signal - and
typically rely on combining both channels into a single-channel signal before
making predictions. This paper proposes a non-intrusive SI measure that
computes features from a binaural input signal using a combination of vector
quantization (VQ) and contrastive predictive coding (CPC) methods. VQ-CPC
feature extraction does not rely on any model of the auditory system and is
instead trained to maximise the mutual information between the input signal and
output features. The computed VQ-CPC features are input to a predicting
function parameterized by a neural network. Two predicting functions are
considered in this paper. Both feature extractor and predicting functions are
trained on simulated binaural signals with isotropic noise. They are tested on
simulated signals with isotropic and real noise. For all signals, the ground
truth scores are the (intrusive) deterministic binaural STOI. Results are
presented in terms of correlations and MSE and demonstrate that VQ-CPC features
are able to capture information relevant to modelling SI and outperform all the
considered benchmarks - even when evaluating on data comprising of different
noise field types.
Related papers
- Noise-Resilient Unsupervised Graph Representation Learning via Multi-Hop Feature Quality Estimation [53.91958614666386]
Unsupervised graph representation learning (UGRL) based on graph neural networks (GNNs)
We propose a novel UGRL method based on Multi-hop feature Quality Estimation (MQE)
arXiv Detail & Related papers (2024-07-29T12:24:28Z) - On Designing Features for Condition Monitoring of Rotating Machines [7.830376406370754]
Various methods for designing input features have been proposed for fault recognition in rotating machines.
This article proposes a novel algorithm to design input features that unifies the feature extraction process for different time-series sensor data.
arXiv Detail & Related papers (2024-02-15T14:08:08Z) - Complex-valued neural networks for voice anti-spoofing [1.1510009152620668]
Current anti-spoofing and audio deepfake detection systems use either magnitude spectrogram-based features (such as CQT or Melspectrograms) or raw audio processed through convolution or sinc-layers.
This paper proposes a new approach that combines the benefits of both methods by using complex-valued neural networks to process the input audio.
Results show that this approach outperforms previous methods on the "In-the-Wild" anti-spoofing dataset and enables interpretation of the results through explainable AI.
arXiv Detail & Related papers (2023-08-22T21:49:38Z) - Investigation of Self-supervised Pre-trained Models for Classification
of Voice Quality from Speech and Neck Surface Accelerometer Signals [27.398425786898223]
This study examines simultaneously-recorded speech and NSA signals in the classification of voice quality.
The effectiveness of pre-trained models is compared in feature extraction between glottal source waveforms and raw signal waveforms for both speech and NSA inputs.
arXiv Detail & Related papers (2023-08-06T23:16:54Z) - MBI-Net: A Non-Intrusive Multi-Branched Speech Intelligibility
Prediction Model for Hearing Aids [22.736703635666164]
We propose a multi-branched speech intelligibility prediction model (MBI-Net) for predicting subjective intelligibility scores of hearing aid (HA) users.
The outputs of the two branches are fused through a linear layer to obtain predicted speech intelligibility scores.
arXiv Detail & Related papers (2022-04-07T09:13:44Z) - DeepA: A Deep Neural Analyzer For Speech And Singing Vocoding [71.73405116189531]
We propose a neural vocoder that extracts F0 and timbre/aperiodicity encoding from the input speech that emulates those defined in conventional vocoders.
As the deep neural analyzer is learnable, it is expected to be more accurate for signal reconstruction and manipulation, and generalizable from speech to singing.
arXiv Detail & Related papers (2021-10-13T01:39:57Z) - Training a Deep Neural Network via Policy Gradients for Blind Source
Separation in Polyphonic Music Recordings [1.933681537640272]
We propose a method for the blind separation of sounds of musical instruments in audio signals.
We describe the individual tones via a parametric model, training a dictionary to capture the relative amplitudes of the harmonics.
Our algorithm yields high-quality results with particularly low interference on a variety of different audio samples.
arXiv Detail & Related papers (2021-07-09T06:17:04Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Optimal Learning with Excitatory and Inhibitory synapses [91.3755431537592]
I study the problem of storing associations between analog signals in the presence of correlations.
I characterize the typical learning performance in terms of the power spectrum of random input and output processes.
arXiv Detail & Related papers (2020-05-25T18:25:54Z) - Data-Driven Symbol Detection via Model-Based Machine Learning [117.58188185409904]
We review a data-driven framework to symbol detection design which combines machine learning (ML) and model-based algorithms.
In this hybrid approach, well-known channel-model-based algorithms are augmented with ML-based algorithms to remove their channel-model-dependence.
Our results demonstrate that these techniques can yield near-optimal performance of model-based algorithms without knowing the exact channel input-output statistical relationship.
arXiv Detail & Related papers (2020-02-14T06:58:27Z) - Generating diverse and natural text-to-speech samples using a quantized
fine-grained VAE and auto-regressive prosody prior [53.69310441063162]
This paper proposes a sequential prior in a discrete latent space which can generate more naturally sounding samples.
We evaluate the approach using listening tests, objective metrics of automatic speech recognition (ASR) performance, and measurements of prosody attributes.
arXiv Detail & Related papers (2020-02-06T12:35:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.