Probing the Information Encoded in Neural-based Acoustic Models of
Automatic Speech Recognition Systems
- URL: http://arxiv.org/abs/2402.19443v1
- Date: Thu, 29 Feb 2024 18:43:53 GMT
- Title: Probing the Information Encoded in Neural-based Acoustic Models of
Automatic Speech Recognition Systems
- Authors: Quentin Raymondaud, Mickael Rouvier, Richard Dufour
- Abstract summary: This article aims to determine which and where information is located in an automatic speech recognition acoustic model (AM)
Experiments are performed on speaker verification, acoustic environment classification, gender classification, tempo-distortion detection systems and speech sentiment/emotion identification.
Analysis showed that neural-based AMs hold heterogeneous information that seems surprisingly uncorrelated with phoneme recognition.
- Score: 7.207019635697126
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Deep learning architectures have made significant progress in terms of
performance in many research areas. The automatic speech recognition (ASR)
field has thus benefited from these scientific and technological advances,
particularly for acoustic modeling, now integrating deep neural network
architectures. However, these performance gains have translated into increased
complexity regarding the information learned and conveyed through these
black-box architectures. Following many researches in neural networks
interpretability, we propose in this article a protocol that aims to determine
which and where information is located in an ASR acoustic model (AM). To do so,
we propose to evaluate AM performance on a determined set of tasks using
intermediate representations (here, at different layer levels). Regarding the
performance variation and targeted tasks, we can emit hypothesis about which
information is enhanced or perturbed at different architecture steps.
Experiments are performed on both speaker verification, acoustic environment
classification, gender classification, tempo-distortion detection systems and
speech sentiment/emotion identification. Analysis showed that neural-based AMs
hold heterogeneous information that seems surprisingly uncorrelated with
phoneme recognition, such as emotion, sentiment or speaker identity. The
low-level hidden layers globally appears useful for the structuring of
information while the upper ones would tend to delete useless information for
phoneme recognition.
Related papers
- SONAR: A Synthetic AI-Audio Detection Framework and Benchmark [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.
It aims to provide a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.
It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based deepfake detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z) - What to Remember: Self-Adaptive Continual Learning for Audio Deepfake
Detection [53.063161380423715]
Existing detection models have shown remarkable success in discriminating known deepfake audio, but struggle when encountering new attack types.
We propose a continual learning approach called Radian Weight Modification (RWM) for audio deepfake detection.
arXiv Detail & Related papers (2023-12-15T09:52:17Z) - Deep Neural Networks for Automatic Speaker Recognition Do Not Learn
Supra-Segmental Temporal Features [2.724035499453558]
We present and apply a novel test to quantify to what extent the performance of state-of-the-art neural networks for speaker recognition can be explained by modeling SST.
We find that a variety of CNN- and RNN-based neural network architectures for speaker recognition do not model SST to any sufficient degree, even when forced.
arXiv Detail & Related papers (2023-11-01T12:45:31Z) - Insights on Neural Representations for End-to-End Speech Recognition [28.833851817220616]
End-to-end automatic speech recognition (ASR) models aim to learn a generalised speech representation.
Previous investigations of network similarities using correlation analysis techniques have not been explored for End-to-End ASR models.
This paper analyses and explores the internal dynamics between layers during training with CNN, LSTM and Transformer based approaches.
arXiv Detail & Related papers (2022-05-19T10:19:32Z) - Temporal Knowledge Distillation for On-device Audio Classification [2.2731658205414025]
We propose a new knowledge distillation method designed to incorporate the temporal knowledge embedded in attention weights of large models to on-device models.
Our proposed method improves the predictive performance across diverse on-device architectures.
arXiv Detail & Related papers (2021-10-27T02:29:54Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z) - Speech Command Recognition in Computationally Constrained Environments
with a Quadratic Self-organized Operational Layer [92.37382674655942]
We propose a network layer to enhance the speech command recognition capability of a lightweight network.
The employed method borrows the ideas of Taylor expansion and quadratic forms to construct a better representation of features in both input and hidden layers.
This richer representation results in recognition accuracy improvement as shown by extensive experiments on Google speech commands (GSC) and synthetic speech commands (SSC) datasets.
arXiv Detail & Related papers (2020-11-23T14:40:18Z) - AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech.
Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times.
Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z) - AudioMNIST: Exploring Explainable Artificial Intelligence for Audio
Analysis on a Simple Benchmark [12.034688724153044]
This paper explores post-hoc explanations for deep neural networks in the audio domain.
We present a novel Open Source audio dataset consisting of 30,000 audio samples of English spoken digits.
We demonstrate the superior interpretability of audible explanations over visual ones in a human user study.
arXiv Detail & Related papers (2018-07-09T23:11:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.