Self-attention encoding and pooling for speaker recognition
- URL: http://arxiv.org/abs/2008.01077v1
- Date: Mon, 3 Aug 2020 09:31:27 GMT
- Title: Self-attention encoding and pooling for speaker recognition
- Authors: Pooyan Safari, Miquel India and Javier Hernando
- Abstract summary: We propose a tandem Self-Attention and Pooling (SAEP) mechanism to obtain a discriminative speaker embedding given non-fixed length speech utterances.
SAEP encodes short-term speaker spectral features into speaker embeddings to be used in text-independent speaker verification.
We have evaluated this approach on both VoxCeleb1 & 2 datasets.
- Score: 16.96341561111918
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The computing power of mobile devices limits the end-user applications in
terms of storage size, processing, memory and energy consumption. These
limitations motivate researchers for the design of more efficient deep models.
On the other hand, self-attention networks based on Transformer architecture
have attracted remarkable interests due to their high parallelization
capabilities and strong performance on a variety of Natural Language Processing
(NLP) applications. Inspired by the Transformer, we propose a tandem
Self-Attention Encoding and Pooling (SAEP) mechanism to obtain a discriminative
speaker embedding given non-fixed length speech utterances. SAEP is a stack of
identical blocks solely relied on self-attention and position-wise feed-forward
networks to create vector representation of speakers. This approach encodes
short-term speaker spectral features into speaker embeddings to be used in
text-independent speaker verification. We have evaluated this approach on both
VoxCeleb1 & 2 datasets. The proposed architecture is able to outperform the
baseline x-vector, and shows competitive performance to some other benchmarks
based on convolutions, with a significant reduction in model size. It employs
94%, 95%, and 73% less parameters compared to ResNet-34, ResNet-50, and
x-vector, respectively. This indicates that the proposed fully attention based
architecture is more efficient in extracting time-invariant features from
speaker utterances.
Related papers
- Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - Parameter Efficient Audio Captioning With Faithful Guidance Using
Audio-text Shared Latent Representation [0.9285295512807729]
We propose a data augmentation technique for generating hallucinated audio captions and show that similarity based on an audio-text shared latent space is suitable for detecting hallucination.
We then propose a parameter efficient inference time faithful decoding algorithm that enables smaller audio captioning models with performance equivalent to larger models trained with more data.
arXiv Detail & Related papers (2023-09-06T19:42:52Z) - SRU++: Pioneering Fast Recurrence with Attention for Speech Recognition [49.42625022146008]
We present the advantages of applying SRU++ in ASR tasks by comparing with Conformer across multiple ASR benchmarks.
Specifically, SRU++ can surpass Conformer on long-form speech input with a large margin, based on our analysis.
arXiv Detail & Related papers (2021-10-11T19:23:50Z) - End-to-End Speaker-Attributed ASR with Transformer [41.7739129773237]
This paper presents an end-to-end speaker-attributed automatic speech recognition system.
It jointly performs speaker counting, speech recognition and speaker identification for monaural multi-talker audio.
arXiv Detail & Related papers (2021-04-05T19:54:15Z) - A Hierarchical Transformer with Speaker Modeling for Emotion Recognition
in Conversation [12.065178204539693]
Emotion Recognition in Conversation (ERC) is a personalized and interactive emotion recognition task.
Current method models speakers' interactions by building a relation between every two speakers.
We simplify the complicated modeling to a binary version: Intra-Speaker and Inter-Speaker dependencies.
arXiv Detail & Related papers (2020-12-29T14:47:35Z) - T-vectors: Weakly Supervised Speaker Identification Using Hierarchical
Transformer Model [36.372432408617584]
This paper proposes a hierarchical network with transformer encoders and memory mechanism to address this problem.
The proposed model contains a frame-level encoder and segment-level encoder, both of them make use of the transformer encoder block.
arXiv Detail & Related papers (2020-10-29T09:38:17Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Unsupervised Speaker Adaptation using Attention-based Speaker Memory for
End-to-End ASR [61.55606131634891]
We propose an unsupervised speaker adaptation method inspired by the neural Turing machine for end-to-end (E2E) automatic speech recognition (ASR)
The proposed model contains a memory block that holds speaker i-vectors extracted from the training data and reads relevant i-vectors from the memory through an attention mechanism.
We show that M-vectors, which do not require an auxiliary speaker embedding extraction system at test time, achieve similar word error rates (WERs) compared to i-vectors for single speaker utterances and significantly lower WERs for utterances in which there are speaker changes
arXiv Detail & Related papers (2020-02-14T18:31:31Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.