Learning from human perception to improve automatic speaker verification
in style-mismatched conditions
- URL: http://arxiv.org/abs/2206.13684v1
- Date: Tue, 28 Jun 2022 01:24:38 GMT
- Title: Learning from human perception to improve automatic speaker verification
in style-mismatched conditions
- Authors: Amber Afshan, Abeer Alwan
- Abstract summary: Our prior experiments show that humans and machines seem to employ different approaches to speaker discrimination.
We use insights learnt from human perception to design a new training loss function that we refer to as "CllrCE loss"
CllrCE loss uses both speaker-specific idiosyncrasies and relative acoustic distances between speakers to train the ASV system.
- Score: 21.607777746331998
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Our prior experiments show that humans and machines seem to employ different
approaches to speaker discrimination, especially in the presence of speaking
style variability. The experiments examined read versus conversational speech.
Listeners focused on speaker-specific idiosyncrasies while "telling speakers
together", and on relative distances in a shared acoustic space when "telling
speakers apart". However, automatic speaker verification (ASV) systems use the
same loss function irrespective of target or non-target trials. To improve ASV
performance in the presence of style variability, insights learnt from human
perception are used to design a new training loss function that we refer to as
"CllrCE loss". CllrCE loss uses both speaker-specific idiosyncrasies and
relative acoustic distances between speakers to train the ASV system. When
using the UCLA speaker variability database, in the x-vector and conditioning
setups, CllrCE loss results in significant relative improvements in EER by
1-66%, and minDCF by 1-31% and 1-56%, respectively, when compared to the
x-vector baseline. Using the SITW evaluation tasks, which involve different
conversational speech tasks, the proposed loss combined with self-attention
conditioning results in significant relative improvements in EER by 2-5% and
minDCF by 6-12% over baseline. In the SITW case, performance improvements were
consistent only with conditioning.
Related papers
- SKQVC: One-Shot Voice Conversion by K-Means Quantization with Self-Supervised Speech Representations [12.423959479216895]
One-shot voice conversion (VC) is a method that enables the transformation between any two speakers using only a single target speaker utterance.
Recent works utilizing K-means quantization (KQ) with self-supervised learning (SSL) features have proven capable of capturing content information from speech.
We propose a simple yet effective one-shot VC model that utilizes the characteristics of SSL features and speech attributes.
arXiv Detail & Related papers (2024-11-25T07:14:26Z) - Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion [93.32354378820648]
We introduce MVSD, a mutual learning framework based on diffusion models.
MVSD considers the two tasks symmetrically, exploiting the reciprocal relationship to facilitate learning from inverse tasks.
Our framework can improve the performance of the reverberator and dereverberator.
arXiv Detail & Related papers (2024-07-15T00:47:56Z) - Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation [71.31331402404662]
This paper proposes two novel data-efficient methods to learn dysarthric and elderly speaker-level features.
Speaker-regularized spectral basis embedding-SBE features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation.
Feature-based learning hidden unit contributions (f-LHUC) that are conditioned on VR-LH features that are shown to be insensitive to speaker-level data quantity in testtime adaptation.
arXiv Detail & Related papers (2024-07-08T18:20:24Z) - SVVAD: Personal Voice Activity Detection for Speaker Verification [24.57668015470307]
We propose a speaker verification-based voice activity detection (SVVAD) framework that can adapt the speech features according to which are most informative for speaker verification (SV)
experiments show that SVVAD significantly outperforms the baseline in terms of equal error rate (EER) under conditions where other speakers are mixed at different ratios.
arXiv Detail & Related papers (2023-05-31T05:59:33Z) - Acoustic-to-articulatory Inversion based on Speech Decomposition and
Auxiliary Feature [7.363994037183394]
We pre-train a speech decomposition network to decompose audio speech into speaker embedding and content embedding.
We then propose a novel auxiliary feature network to estimate the lip auxiliary features from the personalized speech features.
Experimental results show that, compared with the state-of-the-art only using the audio speech feature, the proposed method reduces the average RMSE by 0.25 and increases the average correlation coefficient by 2.0%.
arXiv Detail & Related papers (2022-04-02T14:47:19Z) - Learning Decoupling Features Through Orthogonality Regularization [55.79910376189138]
Keywords spotting (KWS) and speaker verification (SV) are two important tasks in speech applications.
We develop a two-branch deep network (KWS branch and SV branch) with the same network structure.
A novel decoupling feature learning method is proposed to push up the performance of KWS and SV simultaneously.
arXiv Detail & Related papers (2022-03-31T03:18:13Z) - Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For
Disordered Speech Recognition [57.15942628305797]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems for normal speech.
This paper presents a cross-domain acoustic-to-articulatory (A2A) inversion approach that utilizes the parallel acoustic-articulatory data of the 15-hour TORGO corpus in model training.
Cross-domain adapted to the 102.7-hour UASpeech corpus and to produce articulatory features.
arXiv Detail & Related papers (2022-03-19T08:47:18Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Bayesian Learning for Deep Neural Network Adaptation [57.70991105736059]
A key task for speech recognition systems is to reduce the mismatch between training and evaluation data that is often attributable to speaker differences.
Model-based speaker adaptation approaches often require sufficient amounts of target speaker data to ensure robustness.
This paper proposes a full Bayesian learning based DNN speaker adaptation framework to model speaker-dependent (SD) parameter uncertainty.
arXiv Detail & Related papers (2020-12-14T12:30:41Z) - Variable frame rate-based data augmentation to handle speaking-style
variability for automatic speaker verification [23.970866246001652]
The effects of speaking-style variability on automatic speaker verification were investigated using the UCLA Speaker Variability database.
We propose an entropy-based variable frame rate technique to artificially generate style-normalized representations for PLDA adaptation.
arXiv Detail & Related papers (2020-08-08T22:47:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.