Neural i-vectors
- URL: http://arxiv.org/abs/2004.01559v2
- Date: Sat, 18 Apr 2020 14:47:50 GMT
- Title: Neural i-vectors
- Authors: Ville Vestman, Kong Aik Lee, Tomi H. Kinnunen
- Abstract summary: We investigate the use of deep embedding extractor and i-vector extractor in succession.
To bundle the deep embedding extractor with an i-vector extractor, we adopt aggregation layers inspired by the Gaussian mixture model (GMM) to the embedding extractor networks.
We compare the deep embeddings to the proposed neural i-vectors on the Speakers in the Wild (SITW) and the Speaker Recognition Evaluation (SRE) 2018 and 2019 datasets.
- Score: 21.13825969777844
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep speaker embeddings have been demonstrated to outperform their generative
counterparts, i-vectors, in recent speaker verification evaluations. To combine
the benefits of high performance and generative interpretation, we investigate
the use of deep embedding extractor and i-vector extractor in succession. To
bundle the deep embedding extractor with an i-vector extractor, we adopt
aggregation layers inspired by the Gaussian mixture model (GMM) to the
embedding extractor networks. The inclusion of GMM-like layer allows the
discriminatively trained network to be used as a provider of sufficient
statistics for the i-vector extractor to extract what we call neural i-vectors.
We compare the deep embeddings to the proposed neural i-vectors on the Speakers
in the Wild (SITW) and the Speaker Recognition Evaluation (SRE) 2018 and 2019
datasets. On the core-core condition of SITW, our deep embeddings obtain
performance comparative to the state-of-the-art. The neural i-vectors obtain
about 50% worse performance than the deep embeddings, but on the other hand
outperform the previous i-vector approaches reported in the literature by a
clear margin.
Related papers
- VQ-T: RNN Transducers using Vector-Quantized Prediction Network States [52.48566999668521]
We propose to use vector-quantized long short-term memory units in the prediction network of RNN transducers.
By training the discrete representation jointly with the ASR network, hypotheses can be actively merged for lattice generation.
Our experiments on the Switchboard corpus show that the proposed VQ RNN transducers improve ASR performance over transducers with regular prediction networks.
arXiv Detail & Related papers (2022-08-03T02:45:52Z) - Learning Cluster Patterns for Abstractive Summarization [0.0]
We consider two clusters of salient and non-salient context vectors, using which the decoder can attend more to salient context vectors for summary generation.
Our experimental result shows that the proposed model outperforms the existing BART model by learning these distinct cluster patterns.
arXiv Detail & Related papers (2022-02-22T15:15:24Z) - STC speaker recognition systems for the NIST SRE 2021 [56.05258832139496]
This paper presents a description of STC Ltd. systems submitted to the NIST 2021 Speaker Recognition Evaluation.
These systems consists of a number of diverse subsystems based on using deep neural networks as feature extractors.
For video modality we developed our best solution with RetinaFace face detector and deep ResNet face embeddings extractor trained on large face image datasets.
arXiv Detail & Related papers (2021-11-03T15:31:01Z) - A SAR speckle filter based on Residual Convolutional Neural Networks [68.8204255655161]
This work aims to present a novel method for filtering the speckle noise from Sentinel-1 data by applying Deep Learning (DL) algorithms, based on Convolutional Neural Networks (CNNs)
The obtained results, if compared with the state of the art, show a clear improvement in terms of Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM)
arXiv Detail & Related papers (2021-04-19T14:43:07Z) - Adversarial Feature Augmentation and Normalization for Visual
Recognition [109.6834687220478]
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models.
Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings.
We validate the proposed approach across diverse visual recognition tasks with representative backbone networks.
arXiv Detail & Related papers (2021-03-22T20:36:34Z) - Neural Contextual Bandits with Deep Representation and Shallow
Exploration [105.8099566651448]
We propose a novel learning algorithm that transforms the raw feature vector using the last hidden layer of a deep ReLU neural network.
Compared with existing neural contextual bandit algorithms, our approach is computationally much more efficient since it only needs to explore in the last layer of the deep neural network.
arXiv Detail & Related papers (2020-12-03T09:17:55Z) - Combination of Deep Speaker Embeddings for Diarisation [9.053645441056256]
This paper proposes a c-vector method by combining multiple sets of complementary d-vectors derived from systems with different NN components.
A neural-based single-pass speaker diarisation pipeline is also proposed in this paper.
Experiments and detailed analyses are conducted on the challenging AMI and NIST RT05 datasets.
arXiv Detail & Related papers (2020-10-22T20:16:36Z) - Compact Speaker Embedding: lrx-vector [23.297692312524546]
We present the lrx-vector system, which is the low-rank factorized version of the x-vector embedding network.
The primary objective of this topology is to further reduce the memory requirement of the speaker recognition system.
arXiv Detail & Related papers (2020-08-11T21:32:16Z) - Probabilistic embeddings for speaker diarization [13.276960253126656]
Speaker embeddings (x-vectors) extracted from very short segments of speech have recently been shown to give competitive performance in speaker diarization.
We generalize this recipe by extracting from each speech segment, in parallel with the x-vector, also a diagonal precision matrix.
These precisions quantify the uncertainty about what the values of the embeddings might have been if they had been extracted from high quality speech segments.
arXiv Detail & Related papers (2020-04-06T14:51:01Z) - Unsupervised Speaker Adaptation using Attention-based Speaker Memory for
End-to-End ASR [61.55606131634891]
We propose an unsupervised speaker adaptation method inspired by the neural Turing machine for end-to-end (E2E) automatic speech recognition (ASR)
The proposed model contains a memory block that holds speaker i-vectors extracted from the training data and reads relevant i-vectors from the memory through an attention mechanism.
We show that M-vectors, which do not require an auxiliary speaker embedding extraction system at test time, achieve similar word error rates (WERs) compared to i-vectors for single speaker utterances and significantly lower WERs for utterances in which there are speaker changes
arXiv Detail & Related papers (2020-02-14T18:31:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.