Visualizing Automatic Speech Recognition -- Means for a Better
Understanding?
- URL: http://arxiv.org/abs/2202.00673v1
- Date: Tue, 1 Feb 2022 13:35:08 GMT
- Title: Visualizing Automatic Speech Recognition -- Means for a Better
Understanding?
- Authors: Karla Markert and Romain Parracone and Mykhailo Kulakov and Philip
Sperl and Ching-Yu Kao and Konstantin B\"ottinger
- Abstract summary: We show how attribution methods, that we import from image recognition and suitably adapt to handle audio data, can help to clarify the working of ASR.
Taking Speech Deep, an end-to-end model for ASR, as a case study, we show how these techniques help to visualize which features of the input are the most influential in determining the output.
- Score: 0.1868368163807795
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic speech recognition (ASR) is improving ever more at mimicking human
speech processing. The functioning of ASR, however, remains to a large extent
obfuscated by the complex structure of the deep neural networks (DNNs) they are
based on. In this paper, we show how so-called attribution methods, that we
import from image recognition and suitably adapt to handle audio data, can help
to clarify the working of ASR. Taking DeepSpeech, an end-to-end model for ASR,
as a case study, we show how these techniques help to visualize which features
of the input are the most influential in determining the output. We focus on
three visualization techniques: Layer-wise Relevance Propagation (LRP),
Saliency Maps, and Shapley Additive Explanations (SHAP). We compare these
methods and discuss potential further applications, such as in the detection of
adversarial examples.
Related papers
- Robust Audiovisual Speech Recognition Models with Mixture-of-Experts [67.75334989582709]
We introduce EVA, leveraging the mixture-of-Experts for audioVisual ASR to perform robust speech recognition for in-the-wild'' videos.
We first encode visual information into visual tokens sequence and map them into speech space by a lightweight projection.
Experiments show our model achieves state-of-the-art results on three benchmarks.
arXiv Detail & Related papers (2024-09-19T00:08:28Z) - Large Language Models Are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.
We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.
We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.81% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z) - Lip2Vec: Efficient and Robust Visual Speech Recognition via
Latent-to-Latent Visual to Audio Representation Mapping [4.271091833712731]
We propose a simple approach, named Lip2Vec that is based on learning a prior model.
The proposed model compares favorably with fully-supervised learning methods on the LRS3 dataset achieving 26 WER.
We believe that reprogramming the VSR as an ASR task narrows the performance gap between the two and paves the way for more flexible formulations of lip reading.
arXiv Detail & Related papers (2023-08-11T12:59:02Z) - Exploring the Integration of Speech Separation and Recognition with
Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end.
We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model.
A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z) - Content-Context Factorized Representations for Automated Speech
Recognition [12.618527387900079]
We introduce an unsupervised, encoder-agnostic method for factoring speech-encoder representations into explicit content-encoding representations and spurious context-encoding representations.
We demonstrate improved performance on standard ASR benchmarks, as well as improved performance in both real-world and artificially noisy ASR scenarios.
arXiv Detail & Related papers (2022-05-19T21:34:40Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - An Attribute-Aligned Strategy for Learning Speech Representation [57.891727280493015]
We propose an attribute-aligned learning strategy to derive speech representation that can flexibly address these issues by attribute-selection mechanism.
Specifically, we propose a layered-representation variational autoencoder (LR-VAE), which factorizes speech representation into attribute-sensitive nodes.
Our proposed method achieves competitive performances on identity-free SER and a better performance on emotionless SV.
arXiv Detail & Related papers (2021-06-05T06:19:14Z) - On the Impact of Word Error Rate on Acoustic-Linguistic Speech Emotion
Recognition: An Update for the Deep Learning Era [0.0]
We create transcripts from the original speech by applying three modern ASR systems.
For extraction and learning of acoustic speech features, we utilise openSMILE, openXBoW, DeepSpectrum, and auDeep.
We achieve state-of-the-art unweighted average recall values of $73.6,%$ and $73.8,%$ on the speaker-independent development and test partitions of IEMOCAP.
arXiv Detail & Related papers (2021-04-20T17:10:01Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z) - Gradient-Adjusted Neuron Activation Profiles for Comprehensive
Introspection of Convolutional Speech Recognition Models [1.6752182911522515]
We introduce Gradient-adjusted Neuron Activation Profiles (GradNAPs) as means to interpret features and representations in Deep Neural Networks.
GradNAPs are characteristic responses of ANNs to particular groups of inputs, which incorporate the relevance of neurons for prediction.
We show how to utilize GradNAPs to gain insight about how data is processed in ANNs.
arXiv Detail & Related papers (2020-02-19T11:59:36Z) - Deep Representation Learning in Speech Processing: Challenges, Recent
Advances, and Future Trends [10.176394550114411]
The main contribution of this paper is to present an up-to-date and comprehensive survey on different techniques of speech representation learning.
Recent reviews in speech have been conducted for ASR, SR, and SER, however, none of these has focused on the representation learning from speech.
arXiv Detail & Related papers (2020-01-02T10:12:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.