Attention-based conditioning methods using variable frame rate for
style-robust speaker verification
- URL: http://arxiv.org/abs/2206.13680v1
- Date: Tue, 28 Jun 2022 01:14:09 GMT
- Title: Attention-based conditioning methods using variable frame rate for
style-robust speaker verification
- Authors: Amber Afshan, Abeer Alwan
- Abstract summary: We propose an approach to extract speaker embeddings robust to speaking style variations in text-independent speaker verification.
An entropy-based variable frame rate vector is proposed as an external conditioning vector for the self-attention layer.
- Score: 21.607777746331998
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose an approach to extract speaker embeddings that are robust to
speaking style variations in text-independent speaker verification. Typically,
speaker embedding extraction includes training a DNN for speaker classification
and using the bottleneck features as speaker representations. Such a network
has a pooling layer to transform frame-level to utterance-level features by
calculating statistics over all utterance frames, with equal weighting.
However, self-attentive embeddings perform weighted pooling such that the
weights correspond to the importance of the frames in a speaker classification
task. Entropy can capture acoustic variability due to speaking style
variations. Hence, an entropy-based variable frame rate vector is proposed as
an external conditioning vector for the self-attention layer to provide the
network with information that can address style effects. This work explores
five different approaches to conditioning. The best conditioning approach,
concatenation with gating, provided statistically significant improvements over
the x-vector baseline in 12/23 tasks and was the same as the baseline in 11/23
tasks when using the UCLA speaker variability database. It also significantly
outperformed self-attention without conditioning in 9/23 tasks and was worse in
1/23. The method also showed significant improvements in multi-speaker
scenarios of SITW.
Related papers
- VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading [73.59525356467574]
A speaker's own characteristics can always be portrayed well by his/her few facial images or even a single image with shallow networks.
Fine-grained dynamic features associated with speech content expressed by the talking face always need deep sequential networks.
Our approach consistently outperforms existing methods.
arXiv Detail & Related papers (2023-10-08T07:48:25Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - Self-supervised Fine-tuning for Improved Content Representations by
Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method.
Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z) - Speech Separation based on Contrastive Learning and Deep Modularization [3.2634122554914002]
In this paper, we use contrastive learning to establish the representations of frames then use the learned representations in the downstream deep modularization task.
Based on this, we implement a self-supervised learning to learn to minimize the distance between frames belonging to a given speaker.
The learned representations are used in a downstream deep modularization task to cluster frames based on speaker identity.
arXiv Detail & Related papers (2023-05-18T02:19:05Z) - Improving Prosody for Cross-Speaker Style Transfer by Semi-Supervised
Style Extractor and Hierarchical Modeling in Speech Synthesis [37.65745551401636]
Cross-speaker style transfer in speech synthesis aims at transferring a style from source speaker to synthesized speech of a target speaker's timbre.
In most previous methods, the synthesized fine-grained prosody features often represent the source speaker's average style.
A strength-controlled semi-supervised style extractor is proposed to disentangle the style from content and timbre.
arXiv Detail & Related papers (2023-03-14T08:52:58Z) - Collar-aware Training for Streaming Speaker Change Detection in
Broadcast Speech [0.0]
We present a novel training method for speaker change detection models.
The proposed method uses an objective function which encourages the model to predict a single positive label within a specified collar.
arXiv Detail & Related papers (2022-05-14T15:35:43Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Variable frame rate-based data augmentation to handle speaking-style
variability for automatic speaker verification [23.970866246001652]
The effects of speaking-style variability on automatic speaker verification were investigated using the UCLA Speaker Variability database.
We propose an entropy-based variable frame rate technique to artificially generate style-normalized representations for PLDA adaptation.
arXiv Detail & Related papers (2020-08-08T22:47:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.