Learning spectro-temporal representations of complex sounds with
parameterized neural networks
- URL: http://arxiv.org/abs/2103.07125v1
- Date: Fri, 12 Mar 2021 07:53:47 GMT
- Title: Learning spectro-temporal representations of complex sounds with
parameterized neural networks
- Authors: Rachid Riad and Julien Karadayi and Anne-Catherine Bachoud-L\'evi and
Emmanuel Dupoux
- Abstract summary: We propose a parametrized neural network layer, that computes specific spectro-temporal modulations based on Gabor kernels (Learnable STRFs)
We evaluated predictive capabilities of this layer on Speech Activity Detection, Speaker Verification, Urban Sound Classification and Zebra Finch Call Type Classification.
As this layer is fully interpretable, we used quantitative measures to describe the distribution of the learned spectro-temporal modulations.
- Score: 16.270691619752288
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Deep Learning models have become potential candidates for auditory
neuroscience research, thanks to their recent successes on a variety of
auditory tasks. Yet, these models often lack interpretability to fully
understand the exact computations that have been performed. Here, we proposed a
parametrized neural network layer, that computes specific spectro-temporal
modulations based on Gabor kernels (Learnable STRFs) and that is fully
interpretable. We evaluated predictive capabilities of this layer on Speech
Activity Detection, Speaker Verification, Urban Sound Classification and Zebra
Finch Call Type Classification. We found out that models based on Learnable
STRFs are on par for all tasks with different toplines, and obtain the best
performance for Speech Activity Detection. As this layer is fully
interpretable, we used quantitative measures to describe the distribution of
the learned spectro-temporal modulations. The filters adapted to each task and
focused mostly on low temporal and spectral modulations. The analyses show that
the filters learned on human speech have similar spectro-temporal parameters as
the ones measured directly in the human auditory cortex. Finally, we observed
that the tasks organized in a meaningful way: the human vocalizations tasks
closer to each other and bird vocalizations far away from human vocalizations
and urban sounds tasks.
Related papers
- On the Utility of Speech and Audio Foundation Models for Marmoset Call Analysis [19.205671029694074]
This study assesses feature representations derived from speech and general audio domains, across pre-training bandwidths of 4, 8, and 16 kHz for marmoset call-type and caller classification tasks.
Results show that models with higher bandwidth improve performance, and pre-training on speech or general audio yields comparable results, improving over a spectral baseline.
arXiv Detail & Related papers (2024-07-23T12:00:44Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - Deep Neural Convolutive Matrix Factorization for Articulatory
Representation Decomposition [48.56414496900755]
This work uses a neural implementation of convolutive sparse matrix factorization to decompose the articulatory data into interpretable gestures and gestural scores.
Phoneme recognition experiments were additionally performed to show that gestural scores indeed code phonological information successfully.
arXiv Detail & Related papers (2022-04-01T14:25:19Z) - An Exploration of Prompt Tuning on Generative Spoken Language Model for
Speech Processing Tasks [112.1942546460814]
We report the first exploration of the prompt tuning paradigm for speech processing tasks based on Generative Spoken Language Model (GSLM)
Experiment results show that the prompt tuning technique achieves competitive performance in speech classification tasks with fewer trainable parameters than fine-tuning specialized downstream models.
arXiv Detail & Related papers (2022-03-31T03:26:55Z) - Interpreting deep urban sound classification using Layer-wise Relevance
Propagation [5.177947445379688]
This work focuses on the sensitive application of assisting drivers suffering from hearing loss by constructing a deep neural network for urban sound classification.
We use two different representations of audio signals, i.e. Mel and constant-Q spectrograms, while the decisions made by the deep neural network are explained via layer-wise relevance propagation.
Overall, we present an explainable AI framework for understanding deep urban sound classification.
arXiv Detail & Related papers (2021-11-19T14:15:45Z) - Preliminary study on using vector quantization latent spaces for TTS/VC
systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding.
By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding.
Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - PhyAAt: Physiology of Auditory Attention to Speech Dataset [0.5976833843615385]
Auditory attention to natural speech is a complex brain process.
We present a dataset of physiological signals collected from an experiment on auditory attention to natural speech.
arXiv Detail & Related papers (2020-05-23T17:55:18Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z) - AudioMNIST: Exploring Explainable Artificial Intelligence for Audio
Analysis on a Simple Benchmark [12.034688724153044]
This paper explores post-hoc explanations for deep neural networks in the audio domain.
We present a novel Open Source audio dataset consisting of 30,000 audio samples of English spoken digits.
We demonstrate the superior interpretability of audible explanations over visual ones in a human user study.
arXiv Detail & Related papers (2018-07-09T23:11:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.