Variable frame rate-based data augmentation to handle speaking-style
variability for automatic speaker verification
- URL: http://arxiv.org/abs/2008.03616v1
- Date: Sat, 8 Aug 2020 22:47:12 GMT
- Title: Variable frame rate-based data augmentation to handle speaking-style
variability for automatic speaker verification
- Authors: Amber Afshan, Jinxi Guo, Soo Jin Park, Vijay Ravi, Alan McCree, and
Abeer Alwan
- Abstract summary: The effects of speaking-style variability on automatic speaker verification were investigated using the UCLA Speaker Variability database.
We propose an entropy-based variable frame rate technique to artificially generate style-normalized representations for PLDA adaptation.
- Score: 23.970866246001652
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The effects of speaking-style variability on automatic speaker verification
were investigated using the UCLA Speaker Variability database which comprises
multiple speaking styles per speaker. An x-vector/PLDA (probabilistic linear
discriminant analysis) system was trained with the SRE and Switchboard
databases with standard augmentation techniques and evaluated with utterances
from the UCLA database. The equal error rate (EER) was low when enrollment and
test utterances were of the same style (e.g., 0.98% and 0.57% for read and
conversational speech, respectively), but it increased substantially when
styles were mismatched between enrollment and test utterances. For instance,
when enrolled with conversation utterances, the EER increased to 3.03%, 2.96%
and 22.12% when tested on read, narrative, and pet-directed speech,
respectively. To reduce the effect of style mismatch, we propose an
entropy-based variable frame rate technique to artificially generate
style-normalized representations for PLDA adaptation. The proposed system
significantly improved performance. In the aforementioned conditions, the EERs
improved to 2.69% (conversation -- read), 2.27% (conversation -- narrative),
and 18.75% (pet-directed -- read). Overall, the proposed technique performed
comparably to multi-style PLDA adaptation without the need for training data in
different speaking styles per speaker.
Related papers
- SKQVC: One-Shot Voice Conversion by K-Means Quantization with Self-Supervised Speech Representations [12.423959479216895]
One-shot voice conversion (VC) is a method that enables the transformation between any two speakers using only a single target speaker utterance.
Recent works utilizing K-means quantization (KQ) with self-supervised learning (SSL) features have proven capable of capturing content information from speech.
We propose a simple yet effective one-shot VC model that utilizes the characteristics of SSL features and speech attributes.
arXiv Detail & Related papers (2024-11-25T07:14:26Z) - Speechworthy Instruction-tuned Language Models [71.8586707840169]
We show that both prompting and preference learning increase the speech-suitability of popular instruction-tuned LLMs.
We share lexical, syntactical, and qualitative analyses to showcase how each method contributes to improving the speech-suitability of generated responses.
arXiv Detail & Related papers (2024-09-23T02:34:42Z) - Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation [71.31331402404662]
This paper proposes two novel data-efficient methods to learn dysarthric and elderly speaker-level features.
Speaker-regularized spectral basis embedding-SBE features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation.
Feature-based learning hidden unit contributions (f-LHUC) that are conditioned on VR-LH features that are shown to be insensitive to speaker-level data quantity in testtime adaptation.
arXiv Detail & Related papers (2024-07-08T18:20:24Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - Learning from human perception to improve automatic speaker verification
in style-mismatched conditions [21.607777746331998]
Our prior experiments show that humans and machines seem to employ different approaches to speaker discrimination.
We use insights learnt from human perception to design a new training loss function that we refer to as "CllrCE loss"
CllrCE loss uses both speaker-specific idiosyncrasies and relative acoustic distances between speakers to train the ASV system.
arXiv Detail & Related papers (2022-06-28T01:24:38Z) - Attention-based conditioning methods using variable frame rate for
style-robust speaker verification [21.607777746331998]
We propose an approach to extract speaker embeddings robust to speaking style variations in text-independent speaker verification.
An entropy-based variable frame rate vector is proposed as an external conditioning vector for the self-attention layer.
arXiv Detail & Related papers (2022-06-28T01:14:09Z) - On-the-Fly Feature Based Rapid Speaker Adaptation for Dysarthric and
Elderly Speech Recognition [53.17176024917725]
Scarcity of speaker-level data limits the practical use of data-intensive model based speaker adaptation methods.
This paper proposes two novel forms of data-efficient, feature-based on-the-fly speaker adaptation methods.
arXiv Detail & Related papers (2022-03-28T09:12:24Z) - Investigation of Data Augmentation Techniques for Disordered Speech
Recognition [69.50670302435174]
This paper investigates a set of data augmentation techniques for disordered speech recognition.
Both normal and disordered speech were exploited in the augmentation process.
The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute word error rate (WER)
arXiv Detail & Related papers (2022-01-14T17:09:22Z) - GC-TTS: Few-shot Speaker Adaptation with Geometric Constraints [36.07346889498981]
We propose GC-TTS which achieves high-quality speaker adaptation with significantly improved speaker similarity.
A TTS model is pre-trained for base speakers with a sufficient amount of data, and then fine-tuned for novel speakers on a few minutes of data with two geometric constraints.
The experimental results demonstrate that GC-TTS generates high-quality speech from only a few minutes of training data, outperforming standard techniques in terms of speaker similarity to the target speaker.
arXiv Detail & Related papers (2021-08-16T04:25:31Z) - Bayesian Learning for Deep Neural Network Adaptation [57.70991105736059]
A key task for speech recognition systems is to reduce the mismatch between training and evaluation data that is often attributable to speaker differences.
Model-based speaker adaptation approaches often require sufficient amounts of target speaker data to ensure robustness.
This paper proposes a full Bayesian learning based DNN speaker adaptation framework to model speaker-dependent (SD) parameter uncertainty.
arXiv Detail & Related papers (2020-12-14T12:30:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.