Phoneme-aware and Channel-wise Attentive Learning for Text
DependentSpeaker Verification
- URL: http://arxiv.org/abs/2106.13514v1
- Date: Fri, 25 Jun 2021 09:11:18 GMT
- Title: Phoneme-aware and Channel-wise Attentive Learning for Text
DependentSpeaker Verification
- Authors: Yan Liu, Zheng Li, Lin Li, Qingyang Hong
- Abstract summary: This paper proposes a multi-task learning network with phoneme-aware and channel-wise attentive learning strategies for text-dependent Speaker Verification (SV)
The proposed system achieves outstanding results for textdependent SV.
- Score: 21.826585075806573
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a multi-task learning network with phoneme-aware and
channel-wise attentive learning strategies for text-dependent Speaker
Verification (SV). In the proposed structure, the frame-level multi-task
learning along with the segment-level adversarial learning is adopted for
speaker embedding extraction. The phoneme-aware attentive pooling is exploited
on frame-level features in the main network for speaker classifier, with the
corresponding posterior probability for the phoneme distribution in the
auxiliary subnet. Further, the introduction of Squeeze and Excitation
(SE-block) performs dynamic channel-wise feature recalibration, which improves
the representational ability. The proposed method exploits speaker
idiosyncrasies associated with pass-phrases, and is further improved by the
phoneme-aware attentive pooling and SE-block from temporal and channel-wise
aspects, respectively. The experiments conducted on RSR2015 Part 1 database
confirm that the proposed system achieves outstanding results for textdependent
SV.
Related papers
- VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - Relational Proxy Loss for Audio-Text based Keyword Spotting [8.932603220365793]
This study aims to improve existing methods by leveraging the structural acoustic embeddings and within text embeddings.
By incorporating RPL, we demonstrated improved performance on the Wall Street Journal (WSJ) corpus.
arXiv Detail & Related papers (2024-06-08T01:21:17Z) - Phonetic-aware speaker embedding for far-field speaker verification [25.50311094643337]
We propose a joint-training speech recognition and speaker recognition framework to exploit phonetic content for far-field speaker verification.
The framework encourages speaker embeddings to preserve phonetic information by matching the frame-based feature maps of a speaker embedding network with wav2vec's vectors.
Results show that the proposed framework outperforms the standard speaker embedding on the VOiCES Challenge 2019 evaluation set and the VoxCeleb1 test set.
arXiv Detail & Related papers (2023-11-27T08:45:35Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Self-supervised Fine-tuning for Improved Content Representations by
Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method.
Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - Deep Learning For Prominence Detection In Children's Read Speech [13.041607703862724]
We present a system that operates on segmented speech waveforms to learn features relevant to prominent word detection for children's oral fluency assessment.
The chosen CRNN (convolutional recurrent neural network) framework, incorporating both word-level features and sequence information, is found to benefit from the perceptually motivated SincNet filters.
arXiv Detail & Related papers (2021-10-27T08:51:42Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Self-supervised Text-independent Speaker Verification using Prototypical
Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning.
We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.