Explaining Spectrograms in Machine Learning: A Study on Neural Networks for Speech Classification
- URL: http://arxiv.org/abs/2407.17416v1
- Date: Wed, 10 Jul 2024 07:37:18 GMT
- Title: Explaining Spectrograms in Machine Learning: A Study on Neural Networks for Speech Classification
- Authors: Jesin James, Balamurali B. T., Binu Abeysinghe, Junchen Liu,
- Abstract summary: This study investigates discriminative patterns learned by neural networks for accurate speech classification.
By examining the activations and features of neural networks for vowel classification, we gain insights into what the networks "see" in spectrograms.
- Score: 2.4472308031704073
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study investigates discriminative patterns learned by neural networks for accurate speech classification, with a specific focus on vowel classification tasks. By examining the activations and features of neural networks for vowel classification, we gain insights into what the networks "see" in spectrograms. Through the use of class activation mapping, we identify the frequencies that contribute to vowel classification and compare these findings with linguistic knowledge. Experiments on a American English dataset of vowels showcases the explainability of neural networks and provides valuable insights into the causes of misclassifications and their characteristics when differentiating them from unvoiced speech. This study not only enhances our understanding of the underlying acoustic cues in vowel classification but also offers opportunities for improving speech recognition by bridging the gap between abstract representations in neural networks and established linguistic knowledge
Related papers
- Acoustic characterization of speech rhythm: going beyond metrics with
recurrent neural networks [0.0]
We train a recurrent neural network on a language identification task over a large database of speech recordings in 21 languages.
The network was able to identify the language of 10-second recordings in 40% of the cases, and the language was in the top-3 guesses in two-thirds of the cases.
arXiv Detail & Related papers (2024-01-22T09:49:44Z) - Seeing in Words: Learning to Classify through Language Bottlenecks [59.97827889540685]
Humans can explain their predictions using succinct and intuitive descriptions.
We show that a vision model whose feature representations are text can effectively classify ImageNet images.
arXiv Detail & Related papers (2023-06-29T00:24:42Z) - Initial Study into Application of Feature Density and
Linguistically-backed Embedding to Improve Machine Learning-based
Cyberbullying Detection [54.83707803301847]
The research was conducted on a Formspring dataset provided in a Kaggle competition on automatic cyberbullying detection.
The study confirmed the effectiveness of Neural Networks in cyberbullying detection and the correlation between classifier performance and Feature Density.
arXiv Detail & Related papers (2022-06-04T03:17:15Z) - Deep Neural Convolutive Matrix Factorization for Articulatory
Representation Decomposition [48.56414496900755]
This work uses a neural implementation of convolutive sparse matrix factorization to decompose the articulatory data into interpretable gestures and gestural scores.
Phoneme recognition experiments were additionally performed to show that gestural scores indeed code phonological information successfully.
arXiv Detail & Related papers (2022-04-01T14:25:19Z) - Deep Learning For Prominence Detection In Children's Read Speech [13.041607703862724]
We present a system that operates on segmented speech waveforms to learn features relevant to prominent word detection for children's oral fluency assessment.
The chosen CRNN (convolutional recurrent neural network) framework, incorporating both word-level features and sequence information, is found to benefit from the perceptually motivated SincNet filters.
arXiv Detail & Related papers (2021-10-27T08:51:42Z) - Perception Point: Identifying Critical Learning Periods in Speech for
Bilingual Networks [58.24134321728942]
We compare and identify cognitive aspects on deep neural-based visual lip-reading models.
We observe a strong correlation between these theories in cognitive psychology and our unique modeling.
arXiv Detail & Related papers (2021-10-13T05:30:50Z) - General-Purpose Speech Representation Learning through a Self-Supervised
Multi-Granularity Framework [114.63823178097402]
This paper presents a self-supervised learning framework, named MGF, for general-purpose speech representation learning.
Specifically, we propose to use generative learning approaches to capture fine-grained information at small time scales and use discriminative learning approaches to distill coarse-grained or semantic information at large time scales.
arXiv Detail & Related papers (2021-02-03T08:13:21Z) - Understanding the Role of Individual Units in a Deep Neural Network [85.23117441162772]
We present an analytic framework to systematically identify hidden units within image classification and image generation networks.
First, we analyze a convolutional neural network (CNN) trained on scene classification and discover units that match a diverse set of object concepts.
Second, we use a similar analytic method to analyze a generative adversarial network (GAN) model trained to generate scenes.
arXiv Detail & Related papers (2020-09-10T17:59:10Z) - Generative Adversarial Phonology: Modeling unsupervised phonetic and
phonological learning with neural networks [0.0]
Training deep neural networks on well-understood dependencies in speech data can provide new insights into how they learn internal representations.
This paper argues that acquisition of speech can be modeled as a dependency between random space and generated speech data in the Generative Adversarial Network architecture.
We propose a methodology to uncover the network's internal representations that correspond to phonetic and phonological properties.
arXiv Detail & Related papers (2020-06-06T20:31:23Z) - AudioMNIST: Exploring Explainable Artificial Intelligence for Audio
Analysis on a Simple Benchmark [12.034688724153044]
This paper explores post-hoc explanations for deep neural networks in the audio domain.
We present a novel Open Source audio dataset consisting of 30,000 audio samples of English spoken digits.
We demonstrate the superior interpretability of audible explanations over visual ones in a human user study.
arXiv Detail & Related papers (2018-07-09T23:11:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.