Deep Neural Networks for Automatic Speaker Recognition Do Not Learn
Supra-Segmental Temporal Features
- URL: http://arxiv.org/abs/2311.00489v2
- Date: Thu, 2 Nov 2023 06:07:14 GMT
- Title: Deep Neural Networks for Automatic Speaker Recognition Do Not Learn
Supra-Segmental Temporal Features
- Authors: Daniel Neururer, Volker Dellwo, Thilo Stadelmann
- Abstract summary: We present and apply a novel test to quantify to what extent the performance of state-of-the-art neural networks for speaker recognition can be explained by modeling SST.
We find that a variety of CNN- and RNN-based neural network architectures for speaker recognition do not model SST to any sufficient degree, even when forced.
- Score: 2.724035499453558
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While deep neural networks have shown impressive results in automatic speaker
recognition and related tasks, it is dissatisfactory how little is understood
about what exactly is responsible for these results. Part of the success has
been attributed in prior work to their capability to model supra-segmental
temporal information (SST), i.e., learn rhythmic-prosodic characteristics of
speech in addition to spectral features. In this paper, we (i) present and
apply a novel test to quantify to what extent the performance of
state-of-the-art neural networks for speaker recognition can be explained by
modeling SST; and (ii) present several means to force respective nets to focus
more on SST and evaluate their merits. We find that a variety of CNN- and
RNN-based neural network architectures for speaker recognition do not model SST
to any sufficient degree, even when forced. The results provide a highly
relevant basis for impactful future research into better exploitation of the
full speech signal and give insights into the inner workings of such networks,
enhancing explainability of deep learning for speech technologies.
Related papers
- Towards Scalable and Versatile Weight Space Learning [51.78426981947659]
This paper introduces the SANE approach to weight-space learning.
Our method extends the idea of hyper-representations towards sequential processing of subsets of neural network weights.
arXiv Detail & Related papers (2024-06-14T13:12:07Z) - Probing the Information Encoded in Neural-based Acoustic Models of
Automatic Speech Recognition Systems [7.207019635697126]
This article aims to determine which and where information is located in an automatic speech recognition acoustic model (AM)
Experiments are performed on speaker verification, acoustic environment classification, gender classification, tempo-distortion detection systems and speech sentiment/emotion identification.
Analysis showed that neural-based AMs hold heterogeneous information that seems surprisingly uncorrelated with phoneme recognition.
arXiv Detail & Related papers (2024-02-29T18:43:53Z) - How neural networks learn to classify chaotic time series [77.34726150561087]
We study the inner workings of neural networks trained to classify regular-versus-chaotic time series.
We find that the relation between input periodicity and activation periodicity is key for the performance of LKCNN models.
arXiv Detail & Related papers (2023-06-04T08:53:27Z) - Surrogate Gradient Spiking Neural Networks as Encoders for Large
Vocabulary Continuous Speech Recognition [91.39701446828144]
We show that spiking neural networks can be trained like standard recurrent neural networks using the surrogate gradient method.
They have shown promising results on speech command recognition tasks.
In contrast to their recurrent non-spiking counterparts, they show robustness to exploding gradient problems without the need to use gates.
arXiv Detail & Related papers (2022-12-01T12:36:26Z) - Spiking Neural Networks for event-based action recognition: A new task to understand their advantage [1.4348901037145936]
Spiking Neural Networks (SNNs) are characterised by their unique temporal dynamics.
We show how Spiking neurons can enable temporal feature extraction in feed-forward neural networks.
We also show how recurrent SNNs can achieve comparable results to LSTM with a smaller number of parameters.
arXiv Detail & Related papers (2022-09-29T16:22:46Z) - Data-driven emergence of convolutional structure in neural networks [83.4920717252233]
We show how fully-connected neural networks solving a discrimination task can learn a convolutional structure directly from their inputs.
By carefully designing data models, we show that the emergence of this pattern is triggered by the non-Gaussian, higher-order local structure of the inputs.
arXiv Detail & Related papers (2022-02-01T17:11:13Z) - Towards efficient end-to-end speech recognition with
biologically-inspired neural networks [10.457580011403289]
We introduce neural connectivity concepts emulating the axo-somatic and the axo-axonic synapses.
We demonstrate for the first time, that a biologically realistic implementation of a large-scale ASR model can yield competitive performance levels.
arXiv Detail & Related papers (2021-10-04T21:24:10Z) - Bayesian Neural Networks: An Introduction and Survey [22.018605089162204]
This article introduces Bayesian Neural Networks (BNNs) and the seminal research regarding their implementation.
Different approximate inference methods are compared, and used to highlight where future research can improve on current methods.
arXiv Detail & Related papers (2020-06-22T06:30:15Z) - Rectified Linear Postsynaptic Potential Function for Backpropagation in
Deep Spiking Neural Networks [55.0627904986664]
Spiking Neural Networks (SNNs) usetemporal spike patterns to represent and transmit information, which is not only biologically realistic but also suitable for ultra-low-power event-driven neuromorphic implementation.
This paper investigates the contribution of spike timing dynamics to information encoding, synaptic plasticity and decision making, providing a new perspective to design of future DeepSNNs and neuromorphic hardware systems.
arXiv Detail & Related papers (2020-03-26T11:13:07Z) - Untangling in Invariant Speech Recognition [17.996356271398295]
We study how information is untangled within neural networks trained to recognize speech.
We observe speaker-specific nuisance variations are discarded by the network's hierarchy, whereas task-relevant properties are untangled in later layers.
We find that the deep representations carry out significant temporal untangling by efficiently extracting task-relevant features at each time step of the computation.
arXiv Detail & Related papers (2020-03-03T20:48:43Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.