Self-Attentive Multi-Layer Aggregation with Feature Recalibration and
Normalization for End-to-End Speaker Verification System
- URL: http://arxiv.org/abs/2007.13350v2
- Date: Tue, 28 Jul 2020 07:20:40 GMT
- Title: Self-Attentive Multi-Layer Aggregation with Feature Recalibration and
Normalization for End-to-End Speaker Verification System
- Authors: Soonshin Seo, Ji-Hwan Kim
- Abstract summary: We propose a self-attentive multi-layer aggregation with feature recalibration and normalization for end-to-end speaker verification system.
Experimental results using the VoxCeleb1 evaluation dataset showed that the performance of the proposed methods was comparable to that of state-of-the-art models.
- Score: 8.942112181408158
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One of the most important parts of an end-to-end speaker verification system
is the speaker embedding generation. In our previous paper, we reported that
shortcut connections-based multi-layer aggregation improves the
representational power of the speaker embedding. However, the number of model
parameters is relatively large and the unspecified variations increase in the
multi-layer aggregation. Therefore, we propose a self-attentive multi-layer
aggregation with feature recalibration and normalization for end-to-end speaker
verification system. To reduce the number of model parameters, the ResNet,
which scaled channel width and layer depth, is used as a baseline. To control
the variability in the training, a self-attention mechanism is applied to
perform the multi-layer aggregation with dropout regularizations and batch
normalizations. Then, a feature recalibration layer is applied to the
aggregated feature using fully-connected layers and nonlinear activation
functions. Deep length normalization is also used on a recalibrated feature in
the end-to-end training process. Experimental results using the VoxCeleb1
evaluation dataset showed that the performance of the proposed methods was
comparable to that of state-of-the-art models (equal error rate of 4.95% and
2.86%, using the VoxCeleb1 and VoxCeleb2 training datasets, respectively).
Related papers
- Rolling bearing fault diagnosis method based on generative adversarial enhanced multi-scale convolutional neural network model [7.600902237804825]
A rolling bearing fault diagnosis method based on generative adversarial enhanced multi-scale convolutional neural network model is proposed.
Compared with ResNet method, the experimental results show that the proposed method has better generalization performance and anti-noise performance.
arXiv Detail & Related papers (2024-03-21T06:42:35Z) - Towards Accurate Post-training Quantization for Reparameterized Models [6.158896686945439]
Current Post-training Quantization (PTQ) methods often lead to significant accuracy degradation.
This is primarily caused by channel-specific and sample-specific outliers.
We propose RepAPQ, a novel framework that preserves the accuracy of quantized reparameterization models.
arXiv Detail & Related papers (2024-02-25T15:42:12Z) - Latent Autoregressive Source Separation [5.871054749661012]
This paper introduces vector-quantized Latent Autoregressive Source Separation (i.e., de-mixing an input signal into its constituent sources) without requiring additional gradient-based optimization or modifications of existing models.
Our separation method relies on the Bayesian formulation in which the autoregressive models are the priors, and a discrete (non-parametric) likelihood function is constructed by performing frequency counts over latent sums of addend tokens.
arXiv Detail & Related papers (2023-01-09T17:32:00Z) - WLD-Reg: A Data-dependent Within-layer Diversity Regularizer [98.78384185493624]
Neural networks are composed of multiple layers arranged in a hierarchical structure jointly trained with a gradient-based optimization.
We propose to complement this traditional 'between-layer' feedback with additional 'within-layer' feedback to encourage the diversity of the activations within the same layer.
We present an extensive empirical study confirming that the proposed approach enhances the performance of several state-of-the-art neural network models in multiple tasks.
arXiv Detail & Related papers (2023-01-03T20:57:22Z) - Investigation of Different Calibration Methods for Deep Speaker
Embedding based Verification Systems [66.61691401921296]
This paper presents an investigation over several methods of score calibration for deep speaker embedding extractors.
An additional focus of this research is to estimate the impact of score normalization on the calibration performance of the system.
arXiv Detail & Related papers (2022-03-28T21:22:22Z) - Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding [93.16866430882204]
In prior works, frame-level features from one layer are aggregated to form an utterance-level representation.
Inspired by the Transformer network, our proposed method utilizes the hierarchical architecture of stacked self-attention mechanisms.
With more layers stacked, the neural network can learn more discriminative speaker embeddings.
arXiv Detail & Related papers (2021-07-14T05:38:48Z) - Rethinking Skip Connection with Layer Normalization in Transformers and
ResNets [49.87919454950763]
Skip connection is a widely-used technique to improve the performance of deep neural networks.
In this work, we investigate how the scale factors in the effectiveness of the skip connection.
arXiv Detail & Related papers (2021-05-15T11:44:49Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - Set Based Stochastic Subsampling [85.5331107565578]
We propose a set-based two-stage end-to-end neural subsampling model that is jointly optimized with an textitarbitrary downstream task network.
We show that it outperforms the relevant baselines under low subsampling rates on a variety of tasks including image classification, image reconstruction, function reconstruction and few-shot classification.
arXiv Detail & Related papers (2020-06-25T07:36:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.