An Attention Long Short-Term Memory based system for automatic
classification of speech intelligibility
- URL: http://arxiv.org/abs/2402.02850v1
- Date: Mon, 5 Feb 2024 10:03:28 GMT
- Title: An Attention Long Short-Term Memory based system for automatic
classification of speech intelligibility
- Authors: Miguel Fern\'andez-D\'iaz and Ascensi\'on Gallardo-Antol\'in
- Abstract summary: This work is focused on the development of an automatic non-intrusive system for predicting the speech intelligibility level.
The main contribution of our research on this topic is the use of Long Short-Term Memory networks with log-mel spectrograms as input features.
The proposed models are evaluated with the UA-Speech database that contains dysarthric speech with different degrees of severity.
- Score: 2.404313022991873
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech intelligibility can be degraded due to multiple factors, such as noisy
environments, technical difficulties or biological conditions. This work is
focused on the development of an automatic non-intrusive system for predicting
the speech intelligibility level in this latter case. The main contribution of
our research on this topic is the use of Long Short-Term Memory (LSTM) networks
with log-mel spectrograms as input features for this purpose. In addition, this
LSTM-based system is further enhanced by the incorporation of a simple
attention mechanism that is able to determine the more relevant frames to this
task. The proposed models are evaluated with the UA-Speech database that
contains dysarthric speech with different degrees of severity. Results show
that the attention LSTM architecture outperforms both, a reference Support
Vector Machine (SVM)-based system with hand-crafted features and a LSTM-based
system with Mean-Pooling.
Related papers
- Improving Membership Inference in ASR Model Auditing with Perturbed Loss Features [32.765965044767356]
Membership Inference (MI) poses a substantial privacy threat to the training data of Automatic Speech Recognition (ASR) systems.
This paper explores the effectiveness of loss-based features in combination with Gaussian and adversarial perturbations to perform MI in ASR models.
arXiv Detail & Related papers (2024-05-02T11:48:30Z) - On combining acoustic and modulation spectrograms in an attention
LSTM-based system for speech intelligibility level classification [0.0]
We present a non-intrusive system based on LSTM networks with attention mechanism designed for speech intelligibility prediction.
Two different strategies for the combination of per-frame acoustic log-mel and modulation spectrograms into the LSTM framework are explored.
The proposed models are evaluated with the UA-Speech database that contains dysarthric speech with different degrees of severity.
arXiv Detail & Related papers (2024-02-05T10:26:28Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Echotune: A Modular Extractor Leveraging the Variable-Length Nature of Speech in ASR Tasks [4.132793413136553]
We introduce Echo-MSA, a nimble module equipped with a variable-length attention mechanism.
The proposed design captures the variable length feature of speech and addresses the limitations of fixed-length attention.
arXiv Detail & Related papers (2023-09-14T14:51:51Z) - Wider or Deeper Neural Network Architecture for Acoustic Scene
Classification with Mismatched Recording Devices [59.86658316440461]
We present a robust and low complexity system for Acoustic Scene Classification (ASC)
We first construct an ASC baseline system in which a novel inception-residual-based network architecture is proposed to deal with the mismatched recording device issue.
To further improve the performance but still satisfy the low complexity model, we apply two techniques: ensemble of multiple spectrograms and channel reduction.
arXiv Detail & Related papers (2022-03-23T10:27:41Z) - MFA: TDNN with Multi-scale Frequency-channel Attention for
Text-independent Speaker Verification with Short Utterances [94.70787497137854]
We propose a multi-scale frequency-channel attention (MFA) to characterize speakers at different scales through a novel dual-path design which consists of a convolutional neural network and TDNN.
We evaluate the proposed MFA on the VoxCeleb database and observe that the proposed framework with MFA can achieve state-of-the-art performance while reducing parameters and complexity.
arXiv Detail & Related papers (2022-02-03T14:57:05Z) - Network Level Spatial Temporal Traffic State Forecasting with Hierarchical Attention LSTM (HierAttnLSTM) [0.0]
This paper leverages diverse traffic state datasets from the Caltrans Performance Measurement System (PeMS) hosted on the open benchmark.
We integrate cell and hidden states from low-level to high-level Long Short-Term Memory (LSTM) networks with an attention pooling mechanism.
The developed hierarchical structure is designed to account for dependencies across different time scales, capturing the spatial-temporal correlations of network-level traffic states.
arXiv Detail & Related papers (2022-01-15T05:25:03Z) - Learning Spatio-Temporal Specifications for Dynamical Systems [0.757024681220677]
We propose a framework for learning-temporal (ST properties) as logic specifications from data.
We introduce SVM-STL, an extension of Signal Signal Temporal Logic (STL), capable of mitigating temporal and spatial properties of a wide range of dynamical systems.
Our framework utilizes machine learning techniques to learn SVM-STL specifications from system executions given by sequences of spatial patterns.
arXiv Detail & Related papers (2021-12-20T18:03:01Z) - Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech
Recognition [58.69803243323346]
Attention-based end-to-end automatic speech recognition (ASR) systems have recently demonstrated state-of-the-art results for numerous tasks.
However, the application of self-attention and attention-based encoder-decoder models remains challenging for streaming ASR.
We present the dual causal/non-causal self-attention architecture, which in contrast to restricted self-attention prevents the overall context to grow beyond the look-ahead of a single layer.
arXiv Detail & Related papers (2021-07-02T20:56:13Z) - Capturing Multi-Resolution Context by Dilated Self-Attention [58.69803243323346]
We propose a combination of restricted self-attention and a dilation mechanism, which we refer to as dilated self-attention.
The restricted self-attention allows attention to neighboring frames of the query at a high resolution, and the dilation mechanism summarizes distant information to allow attending to it with a lower resolution.
ASR results demonstrate substantial improvements compared to restricted self-attention alone, achieving similar results compared to full-sequence based self-attention with a fraction of the computational costs.
arXiv Detail & Related papers (2021-04-07T02:04:18Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.