On combining acoustic and modulation spectrograms in an attention
LSTM-based system for speech intelligibility level classification
- URL: http://arxiv.org/abs/2402.02865v1
- Date: Mon, 5 Feb 2024 10:26:28 GMT
- Title: On combining acoustic and modulation spectrograms in an attention
LSTM-based system for speech intelligibility level classification
- Authors: Ascensi\'on Gallardo-Antol\'in and Juan M. Montero
- Abstract summary: We present a non-intrusive system based on LSTM networks with attention mechanism designed for speech intelligibility prediction.
Two different strategies for the combination of per-frame acoustic log-mel and modulation spectrograms into the LSTM framework are explored.
The proposed models are evaluated with the UA-Speech database that contains dysarthric speech with different degrees of severity.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech intelligibility can be affected by multiple factors, such as noisy
environments, channel distortions or physiological issues. In this work, we
deal with the problem of automatic prediction of the speech intelligibility
level in this latter case. Starting from our previous work, a non-intrusive
system based on LSTM networks with attention mechanism designed for this task,
we present two main contributions. In the first one, it is proposed the use of
per-frame modulation spectrograms as input features, instead of compact
representations derived from them that discard important temporal information.
In the second one, two different strategies for the combination of per-frame
acoustic log-mel and modulation spectrograms into the LSTM framework are
explored: at decision level or late fusion and at utterance level or
Weighted-Pooling (WP) fusion. The proposed models are evaluated with the
UA-Speech database that contains dysarthric speech with different degrees of
severity. On the one hand, results show that attentional LSTM networks are able
to adequately modeling the modulation spectrograms sequences producing similar
classification rates as in the case of log-mel spectrograms. On the other hand,
both combination strategies, late and WP fusion, outperform the single-feature
systems, suggesting that per-frame log-mel and modulation spectrograms carry
complementary information for the task of speech intelligibility prediction,
than can be effectively exploited by the LSTM-based architectures, being the
system with the WP fusion strategy and Attention-Pooling the one that achieves
best results.
Related papers
- Optimizing Speech Multi-View Feature Fusion through Conditional Computation [51.23624575321469]
Self-supervised learning (SSL) features provide lightweight and versatile multi-view speech representations.
SSL features conflict with traditional spectral features like FBanks in terms of update directions.
We propose a novel generalized feature fusion framework grounded in conditional computation.
arXiv Detail & Related papers (2025-01-14T12:12:06Z) - An Attention Long Short-Term Memory based system for automatic
classification of speech intelligibility [2.404313022991873]
This work is focused on the development of an automatic non-intrusive system for predicting the speech intelligibility level.
The main contribution of our research on this topic is the use of Long Short-Term Memory networks with log-mel spectrograms as input features.
The proposed models are evaluated with the UA-Speech database that contains dysarthric speech with different degrees of severity.
arXiv Detail & Related papers (2024-02-05T10:03:28Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Multi-Scale Spectrogram Modelling for Neural Text-to-Speech [19.42517284981061]
We propose a novel Multi-Scale Spectrogram (MSS) modelling approach to synthesise speech with an improved coarse and fine-grained prosody.
We present details for two specific versions of MSS called Word-level MSS and Sentence-level MSS.
arXiv Detail & Related papers (2021-06-29T18:01:34Z) - Improved MVDR Beamforming Using LSTM Speech Models to Clean Spatial
Clustering Masks [14.942060304734497]
spatial clustering techniques can achieve significant multi-channel noise reduction across relatively arbitrary microphone configurations.
LSTM neural networks have successfully been trained to recognize speech from noise on single-channel inputs, but have difficulty taking full advantage of the information in multi-channel recordings.
This paper integrates these two approaches, training LSTM speech models to clean the masks generated by the Model-based EM Source Separation and Localization (MESSL) spatial clustering method.
arXiv Detail & Related papers (2020-12-02T22:35:00Z) - Enhancement of Spatial Clustering-Based Time-Frequency Masks using LSTM
Neural Networks [3.730592618611028]
We use LSTMs to enhance spatial clustering based time-frequency masks.
We achieve both the signal modeling performance of multiple single-channel LSTM-DNN speech enhancers and the signal separation performance.
We evaluate the intelligibility of the output of each system using word error rate from a Kaldi automatic speech recognizer.
arXiv Detail & Related papers (2020-12-02T22:29:29Z) - Revisiting LSTM Networks for Semi-Supervised Text Classification via
Mixed Objective Function [106.69643619725652]
We develop a training strategy that allows even a simple BiLSTM model, when trained with cross-entropy loss, to achieve competitive results.
We report state-of-the-art results for text classification task on several benchmark datasets.
arXiv Detail & Related papers (2020-09-08T21:55:22Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Audio-Visual Decision Fusion for WFST-based and seq2seq Models [3.2771898634434997]
Under noisy conditions, speech recognition systems suffer from high Word Error Rates (WER)
We propose novel methods to fuse information from audio and visual modalities at inference time.
We show that our methods give significant improvements over acoustic-only WER.
arXiv Detail & Related papers (2020-01-29T13:45:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.