Bilingual Text-dependent Speaker Verification with Pre-trained Models for TdSV Challenge 2024
- URL: http://arxiv.org/abs/2411.10828v1
- Date: Sat, 16 Nov 2024 15:53:03 GMT
- Title: Bilingual Text-dependent Speaker Verification with Pre-trained Models for TdSV Challenge 2024
- Authors: Seyed Ali Farokh,
- Abstract summary: We present our submissions to the Iranian division of the Text-dependent Speaker Verification Challenge (TdSV) 2024.
TdSV aims to determine if a specific phrase was spoken by a target speaker.
For phrase verification, a phrase rejected incorrect phrases, while for speaker verification, a pre-trained ResNet293 with domain adaptation extracted speaker embeddings.
Whisper-PMFA, a pre-trained ASR model adapted for speaker verification, falls short of the performance of pre-trained ResNets.
- Score: 0.0
- License:
- Abstract: This paper presents our submissions to the Iranian division of the Text-dependent Speaker Verification Challenge (TdSV) 2024. TdSV aims to determine if a specific phrase was spoken by a target speaker. We developed two independent subsystems based on pre-trained models: For phrase verification, a phrase classifier rejected incorrect phrases, while for speaker verification, a pre-trained ResNet293 with domain adaptation extracted speaker embeddings for computing cosine similarity scores. In addition, we evaluated Whisper-PMFA, a pre-trained ASR model adapted for speaker verification, and found that, although it outperforms randomly initialized ResNets, it falls short of the performance of pre-trained ResNets, highlighting the importance of large-scale pre-training. The results also demonstrate that achieving competitive performance on TdSV without joint modeling of speaker and text is possible. Our best system achieved a MinDCF of 0.0358 on the evaluation subset and won the challenge.
Related papers
- The SVASR System for Text-dependent Speaker Verification (TdSV) AAIC Challenge 2024 [0.0]
The proposed system incorporates a Fast-Conformer-based ASR module to validate speech content.
For speaker verification, we propose a feature fusion approach that combines speaker embeddings extracted from wav2vec-BERT and ReNet models.
arXiv Detail & Related papers (2024-11-25T10:53:45Z) - Towards single integrated spoofing-aware speaker verification embeddings [63.42889348690095]
This study aims to develop a single integrated spoofing-aware speaker verification embeddings.
We analyze that the inferior performance of single SASV embeddings comes from insufficient amount of training data.
Experiments show dramatic improvements, achieving a SASV-EER of 1.06% on the evaluation protocol of the SASV2022 challenge.
arXiv Detail & Related papers (2023-05-30T14:15:39Z) - Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition [66.94463981654216]
We propose prompt tuning methods of Deep Neural Networks (DNNs) for speaker-adaptive Visual Speech Recognition (VSR)
We finetune prompts on adaptation data of target speakers instead of modifying the pre-trained model parameters.
The effectiveness of the proposed method is evaluated on both word- and sentence-level VSR databases.
arXiv Detail & Related papers (2023-02-16T06:01:31Z) - Pretraining Approaches for Spoken Language Recognition: TalTech
Submission to the OLR 2021 Challenge [0.0]
The paper is based on our submission to the Oriental Language Recognition 2021 Challenge.
For the constrained track, we first trained a Conformer-based encoder-decoder model for multilingual automatic speech recognition.
For the unconstrained task, we relied on both externally available pretrained models as well as external data.
arXiv Detail & Related papers (2022-05-14T15:17:08Z) - STC speaker recognition systems for the NIST SRE 2021 [56.05258832139496]
This paper presents a description of STC Ltd. systems submitted to the NIST 2021 Speaker Recognition Evaluation.
These systems consists of a number of diverse subsystems based on using deep neural networks as feature extractors.
For video modality we developed our best solution with RetinaFace face detector and deep ResNet face embeddings extractor trained on large face image datasets.
arXiv Detail & Related papers (2021-11-03T15:31:01Z) - SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text
Joint Pre-Training [33.02912456062474]
We build a single encoder with the BERT objective on unlabeled text together with the w2v-BERT objective on unlabeled speech.
We demonstrate that incorporating both speech and text data during pre-training can significantly improve downstream quality on CoVoST2 speech translation.
arXiv Detail & Related papers (2021-10-20T00:59:36Z) - A study on the efficacy of model pre-training in developing neural
text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance.
It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z) - SVSNet: An End-to-end Speaker Voice Similarity Assessment Model [61.3813595968834]
We propose SVSNet, the first end-to-end neural network model to assess the speaker voice similarity between natural speech and synthesized speech.
The experimental results on the Voice Conversion Challenge 2018 and 2020 show that SVSNet notably outperforms well-known baseline systems.
arXiv Detail & Related papers (2021-07-20T10:19:46Z) - End-to-End Spoken Language Understanding for Generalized Voice
Assistants [15.241812584273886]
We present our approach to developing an E2E model for generalized speech recognition in commercial voice assistants (VAs)
We propose a fully differentiable, transformer-based, hierarchical system that can be pretrained at both the ASR and NLU levels.
This is then fine-tuned on both transcription and semantic classification losses to handle a diverse set of intent and argument combinations.
arXiv Detail & Related papers (2021-06-16T17:56:47Z) - Self-supervised Text-independent Speaker Verification using Prototypical
Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning.
We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.