Related papers: AudioMNIST: Exploring Explainable Artificial Intelligence for Audio Analysis on a Simple Benchmark

AudioMNIST: Exploring Explainable Artificial Intelligence for Audio Analysis on a Simple Benchmark

URL: http://arxiv.org/abs/1807.03418v3
Date: Mon, 27 Nov 2023 18:26:32 GMT
Title: AudioMNIST: Exploring Explainable Artificial Intelligence for Audio Analysis on a Simple Benchmark
Authors: S\"oren Becker, Johanna Vielhaben, Marcel Ackermann, Klaus-Robert M\"uller, Sebastian Lapuschkin, Wojciech Samek
Abstract summary: This paper explores post-hoc explanations for deep neural networks in the audio domain. We present a novel Open Source audio dataset consisting of 30,000 audio samples of English spoken digits. We demonstrate the superior interpretability of audible explanations over visual ones in a human user study.
Score: 12.034688724153044
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Explainable Artificial Intelligence (XAI) is targeted at understanding how models perform feature selection and derive their classification decisions. This paper explores post-hoc explanations for deep neural networks in the audio domain. Notably, we present a novel Open Source audio dataset consisting of 30,000 audio samples of English spoken digits which we use for classification tasks on spoken digits and speakers' biological sex. We use the popular XAI technique Layer-wise Relevance Propagation (LRP) to identify relevant features for two neural network architectures that process either waveform or spectrogram representations of the data. Based on the relevance scores obtained from LRP, hypotheses about the neural networks' feature selection are derived and subsequently tested through systematic manipulations of the input data. Further, we take a step beyond visual explanations and introduce audible heatmaps. We demonstrate the superior interpretability of audible explanations over visual ones in a human user study.

Related papers

Learning Robust Spatial Representations from Binaural Audio through Feature Distillation [64.36563387033921]
We investigate the use of a pretraining stage based on feature distillation to learn a robust spatial representation of speech without the need for data labels.<n>Our experiments demonstrate that the pretrained models show improved performance in noisy and reverberant environments.
arXiv Detail & Related papers (2025-08-28T15:43:15Z)
SONAR: A Synthetic AI-Audio Detection Framework and Benchmark [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark. It aims to provide a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content. It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based deepfake detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z)
Explaining Spectrograms in Machine Learning: A Study on Neural Networks for Speech Classification [2.4472308031704073]
This study investigates discriminative patterns learned by neural networks for accurate speech classification. By examining the activations and features of neural networks for vowel classification, we gain insights into what the networks "see" in spectrograms.
arXiv Detail & Related papers (2024-07-10T07:37:18Z)
Probing the Information Encoded in Neural-based Acoustic Models of Automatic Speech Recognition Systems [7.207019635697126]
This article aims to determine which and where information is located in an automatic speech recognition acoustic model (AM) Experiments are performed on speaker verification, acoustic environment classification, gender classification, tempo-distortion detection systems and speech sentiment/emotion identification. Analysis showed that neural-based AMs hold heterogeneous information that seems surprisingly uncorrelated with phoneme recognition.
arXiv Detail & Related papers (2024-02-29T18:43:53Z)
XAI-based Comparison of Input Representations for Audio Event Classification [10.874097312428235]
We leverage eXplainable AI (XAI) to understand the underlying classification strategies of models trained on different input representations. Specifically, we compare two model architectures with regard to relevant input features used for Audio Event Detection.
arXiv Detail & Related papers (2023-04-27T08:30:07Z)
Self-supervised models of audio effectively explain human cortical responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system. We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z)
Deep Neural Convolutive Matrix Factorization for Articulatory Representation Decomposition [48.56414496900755]
This work uses a neural implementation of convolutive sparse matrix factorization to decompose the articulatory data into interpretable gestures and gestural scores. Phoneme recognition experiments were additionally performed to show that gestural scores indeed code phonological information successfully.
arXiv Detail & Related papers (2022-04-01T14:25:19Z)
Deep Learning For Prominence Detection In Children's Read Speech [13.041607703862724]
We present a system that operates on segmented speech waveforms to learn features relevant to prominent word detection for children's oral fluency assessment. The chosen CRNN (convolutional recurrent neural network) framework, incorporating both word-level features and sequence information, is found to benefit from the perceptually motivated SincNet filters.
arXiv Detail & Related papers (2021-10-27T08:51:42Z)
DeepA: A Deep Neural Analyzer For Speech And Singing Vocoding [71.73405116189531]
We propose a neural vocoder that extracts F0 and timbre/aperiodicity encoding from the input speech that emulates those defined in conventional vocoders. As the deep neural analyzer is learnable, it is expected to be more accurate for signal reconstruction and manipulation, and generalizable from speech to singing.
arXiv Detail & Related papers (2021-10-13T01:39:57Z)
What do End-to-End Speech Models Learn about Speaker, Language and Channel Information? A Layer-wise and Neuron-level Analysis [16.850888973106706]
We conduct a post-hoc functional interpretability analysis of pretrained speech models using the probing framework. We analyze utterance-level representations of speech models trained for various tasks such as speaker recognition and dialect identification. Our results reveal several novel findings, including: i) channel and gender information are distributed across the network, ii) the information is redundantly available in neurons with respect to a task, and iv) complex properties such as dialectal information are encoded only in the task-oriented pretrained network.
arXiv Detail & Related papers (2021-07-01T13:32:55Z)
AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech. Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times. Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z)
Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions. Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks. This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.