Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding
- URL: http://arxiv.org/abs/2107.06493v1
- Date: Wed, 14 Jul 2021 05:38:48 GMT
- Title: Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding
- Authors: Hongning Zhu, Kong Aik Lee, Haizhou Li
- Abstract summary: In prior works, frame-level features from one layer are aggregated to form an utterance-level representation.
Inspired by the Transformer network, our proposed method utilizes the hierarchical architecture of stacked self-attention mechanisms.
With more layers stacked, the neural network can learn more discriminative speaker embeddings.
- Score: 93.16866430882204
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper proposes a serialized multi-layer multi-head attention for neural
speaker embedding in text-independent speaker verification. In prior works,
frame-level features from one layer are aggregated to form an utterance-level
representation. Inspired by the Transformer network, our proposed method
utilizes the hierarchical architecture of stacked self-attention mechanisms to
derive refined features that are more correlated with speakers. Serialized
attention mechanism contains a stack of self-attention modules to create
fixed-dimensional representations of speakers. Instead of utilizing multi-head
attention in parallel, the proposed serialized multi-layer multi-head attention
is designed to aggregate and propagate attentive statistics from one layer to
the next in a serialized manner. In addition, we employ an input-aware query
for each utterance with the statistics pooling. With more layers stacked, the
neural network can learn more discriminative speaker embeddings. Experiment
results on VoxCeleb1 dataset and SITW dataset show that our proposed method
outperforms other baseline methods, including x-vectors and other x-vectors +
conventional attentive pooling approaches by 9.7% in EER and 8.1% in DCF0.01.
Related papers
- OTSeg: Multi-prompt Sinkhorn Attention for Zero-Shot Semantic Segmentation [57.84148140637513]
Multi-Prompts Sinkhorn Attention (MPSA) effectively replaces cross-attention mechanisms within Transformer framework in multimodal settings.
OTSeg achieves state-of-the-art (SOTA) performance with significant gains on Zero-Shot Semantic (ZS3) tasks.
arXiv Detail & Related papers (2024-03-21T07:15:37Z) - Overlap-aware End-to-End Supervised Hierarchical Graph Clustering for
Speaker Diarization [41.24045486520547]
We propose an end-to-end supervised hierarchical clustering algorithm based on graph neural networks (GNN)
The proposed E-SHARC framework improves significantly over the state-of-art diarization systems.
arXiv Detail & Related papers (2024-01-23T15:35:44Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - Encoder Fusion Network with Co-Attention Embedding for Referring Image
Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network.
A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features.
The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z) - Time-Domain Speech Extraction with Spatial Information and Multi Speaker
Conditioning Mechanism [27.19635746008699]
We present a novel multi-channel speech extraction system to simultaneously extract multiple clean individual sources from a mixture.
The proposed method is built on an improved multi-channel time-domain speech separation network.
Experiments on 2-channel WHAMR! data show that the proposed system improves by 9% relative the source separation performance over a strong multi-channel baseline.
arXiv Detail & Related papers (2021-02-07T10:11:49Z) - Self-attention encoding and pooling for speaker recognition [16.96341561111918]
We propose a tandem Self-Attention and Pooling (SAEP) mechanism to obtain a discriminative speaker embedding given non-fixed length speech utterances.
SAEP encodes short-term speaker spectral features into speaker embeddings to be used in text-independent speaker verification.
We have evaluated this approach on both VoxCeleb1 & 2 datasets.
arXiv Detail & Related papers (2020-08-03T09:31:27Z) - Self-Attentive Multi-Layer Aggregation with Feature Recalibration and
Normalization for End-to-End Speaker Verification System [8.942112181408158]
We propose a self-attentive multi-layer aggregation with feature recalibration and normalization for end-to-end speaker verification system.
Experimental results using the VoxCeleb1 evaluation dataset showed that the performance of the proposed methods was comparable to that of state-of-the-art models.
arXiv Detail & Related papers (2020-07-27T08:10:46Z) - Target-Speaker Voice Activity Detection: a Novel Approach for
Multi-Speaker Diarization in a Dinner Party Scenario [51.50631198081903]
We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach.
TS-VAD directly predicts an activity of each speaker on each time frame.
Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results.
arXiv Detail & Related papers (2020-05-14T21:24:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.