Interpreting intermediate convolutional layers of CNNs trained on raw
speech
- URL: http://arxiv.org/abs/2104.09489v2
- Date: Wed, 21 Apr 2021 17:43:29 GMT
- Title: Interpreting intermediate convolutional layers of CNNs trained on raw
speech
- Authors: Ga\v{s}per Begu\v{s} and Alan Zhou
- Abstract summary: We show that averaging over feature maps after ReLU activation in each convolutional layer yields interpretable time-series data.
The proposed technique enables acoustic analysis of intermediate convolutional layers.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a technique to interpret and visualize intermediate
layers in CNNs trained on raw speech data in an unsupervised manner. We show
that averaging over feature maps after ReLU activation in each convolutional
layer yields interpretable time-series data. The proposed technique enables
acoustic analysis of intermediate convolutional layers. To uncover how
meaningful representation in speech gets encoded in intermediate layers of
CNNs, we manipulate individual latent variables to marginal levels outside of
the training range. We train and probe internal representations on two models
-- a bare WaveGAN architecture and a ciwGAN extension which forces the
Generator to output informative data and results in emergence of linguistically
meaningful representations. Interpretation and visualization is performed for
three basic acoustic properties of speech: periodic vibration (corresponding to
vowels), aperiodic noise vibration (corresponding to fricatives), and silence
(corresponding to stops). We also argue that the proposed technique allows
acoustic analysis of intermediate layers that parallels the acoustic analysis
of human speech data: we can extract F0, intensity, duration, formants, and
other acoustic properties from intermediate layers in order to test where and
how CNNs encode various types of information. The models are trained on two
speech processes with different degrees of complexity: a simple presence of [s]
and a computationally complex presence of reduplication (copied material).
Observing the causal effect between interpolation and the resulting changes in
intermediate layers can reveal how individual variables get transformed into
spikes in activation in intermediate layers. Using the proposed technique, we
can analyze how linguistically meaningful units in speech get encoded in
different convolutional layers.
Related papers
- High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - A knowledge-driven vowel-based approach of depression classification
from speech using data augmentation [10.961439164833891]
We propose a novel explainable machine learning (ML) model that identifies depression from speech.
Our method first models the variable-length utterances at the local-level into a fixed-size vowel-based embedding.
depression is classified at the global-level from a group of vowel CNN embeddings that serve as the input of another 1D CNN.
arXiv Detail & Related papers (2022-10-27T08:34:08Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - Deep Neural Convolutive Matrix Factorization for Articulatory
Representation Decomposition [48.56414496900755]
This work uses a neural implementation of convolutive sparse matrix factorization to decompose the articulatory data into interpretable gestures and gestural scores.
Phoneme recognition experiments were additionally performed to show that gestural scores indeed code phonological information successfully.
arXiv Detail & Related papers (2022-04-01T14:25:19Z) - 1-D CNN based Acoustic Scene Classification via Reducing Layer-wise
Dimensionality [2.5382095320488665]
This paper presents an alternate representation framework to commonly used time-frequency representation for acoustic scene classification (ASC)
A raw audio signal is represented using a pre-trained convolutional neural network (CNN) using its various intermediate layers.
The proposed framework outperforms the time-frequency representation based methods.
arXiv Detail & Related papers (2022-03-31T02:00:31Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - Interpreting intermediate convolutional layers in unsupervised acoustic
word classification [0.0]
This paper proposes a technique to visualize and interpret intermediate layers of unsupervised deep convolutional neural networks.
A GAN-based architecture (ciwGAN arXiv:2006.02951) was trained on unlabeled sliced lexical items from TIMIT.
arXiv Detail & Related papers (2021-10-05T21:53:32Z) - What do End-to-End Speech Models Learn about Speaker, Language and
Channel Information? A Layer-wise and Neuron-level Analysis [16.850888973106706]
We conduct a post-hoc functional interpretability analysis of pretrained speech models using the probing framework.
We analyze utterance-level representations of speech models trained for various tasks such as speaker recognition and dialect identification.
Our results reveal several novel findings, including: i) channel and gender information are distributed across the network, ii) the information is redundantly available in neurons with respect to a task, and iv) complex properties such as dialectal information are encoded only in the task-oriented pretrained network.
arXiv Detail & Related papers (2021-07-01T13:32:55Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Local and non-local dependency learning and emergence of rule-like
representations in speech data by Deep Convolutional Generative Adversarial
Networks [0.0]
This paper argues that training GANs on local and non-local dependencies in speech data offers insights into how deep neural networks discretize continuous data.
arXiv Detail & Related papers (2020-09-27T00:02:34Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.