SVVAD: Personal Voice Activity Detection for Speaker Verification
- URL: http://arxiv.org/abs/2305.19581v1
- Date: Wed, 31 May 2023 05:59:33 GMT
- Title: SVVAD: Personal Voice Activity Detection for Speaker Verification
- Authors: Zuheng Kang, Jianzong Wang, Junqing Peng, Jing Xiao
- Abstract summary: We propose a speaker verification-based voice activity detection (SVVAD) framework that can adapt the speech features according to which are most informative for speaker verification (SV)
experiments show that SVVAD significantly outperforms the baseline in terms of equal error rate (EER) under conditions where other speakers are mixed at different ratios.
- Score: 24.57668015470307
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Voice activity detection (VAD) improves the performance of speaker
verification (SV) by preserving speech segments and attenuating the effects of
non-speech. However, this scheme is not ideal: (1) it fails in noisy
environments or multi-speaker conversations; (2) it is trained based on
inaccurate non-SV sensitive labels. To address this, we propose a speaker
verification-based voice activity detection (SVVAD) framework that can adapt
the speech features according to which are most informative for SV. To achieve
this, we introduce a label-free training method with triplet-like losses that
completely avoids the performance degradation of SV due to incorrect labeling.
Extensive experiments show that SVVAD significantly outperforms the baseline in
terms of equal error rate (EER) under conditions where other speakers are mixed
at different ratios. Moreover, the decision boundaries reveal the importance of
the different parts of speech, which are largely consistent with human
judgments.
Related papers
- Speaker Tagging Correction With Non-Autoregressive Language Models [0.0]
We propose a speaker tagging correction system based on a non-autoregressive language model.
We show that the employed error correction approach leads to reductions in word diarization error rate (WDER) on two datasets.
arXiv Detail & Related papers (2024-08-30T11:02:17Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - In search of strong embedding extractors for speaker diarisation [49.7017388682077]
We tackle two key problems when adopting EEs for speaker diarisation.
First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation.
We show that better performance on widely adopted speaker verification evaluation protocols does not lead to better diarisation performance.
We propose two data augmentation techniques to alleviate the second problem, making embedding extractors aware of overlapped speech or speaker change input.
arXiv Detail & Related papers (2022-10-26T13:00:29Z) - Learning from human perception to improve automatic speaker verification
in style-mismatched conditions [21.607777746331998]
Our prior experiments show that humans and machines seem to employ different approaches to speaker discrimination.
We use insights learnt from human perception to design a new training loss function that we refer to as "CllrCE loss"
CllrCE loss uses both speaker-specific idiosyncrasies and relative acoustic distances between speakers to train the ASV system.
arXiv Detail & Related papers (2022-06-28T01:24:38Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Self-supervised Speaker Recognition Training Using Human-Machine
Dialogues [22.262550043863445]
We investigate how to pretrain speaker recognition models by leveraging dialogues between customers and smart-speaker devices.
We propose an effective rejection mechanism that selectively learns from dialogues based on their acoustic homogeneity.
Experiments demonstrate that the proposed method provides significant performance improvements, superior to earlier work.
arXiv Detail & Related papers (2022-02-07T19:44:54Z) - PL-EESR: Perceptual Loss Based END-TO-END Robust Speaker Representation
Extraction [90.55375210094995]
Speech enhancement aims to improve the perceptual quality of the speech signal by suppression of the background noise.
We propose an end-to-end deep learning framework, dubbed PL-EESR, for robust speaker representation extraction.
arXiv Detail & Related papers (2021-10-03T07:05:29Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and
Fusing Fine-Grained Voice Fragments With Attention [66.77490220410249]
We propose FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0.
FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance.
This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information.
arXiv Detail & Related papers (2020-10-27T09:21:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.