Cross-Lingual Speaker Verification with Domain-Balanced Hard Prototype
Mining and Language-Dependent Score Normalization
- URL: http://arxiv.org/abs/2007.07689v2
- Date: Mon, 10 Aug 2020 13:42:58 GMT
- Title: Cross-Lingual Speaker Verification with Domain-Balanced Hard Prototype
Mining and Language-Dependent Score Normalization
- Authors: Jenthe Thienpondt, Brecht Desplanques, Kris Demuynck
- Abstract summary: This paper describes the top-scoring IDLab submission for the Short-duration Speaker Verification (SdSV) Challenge 2020.
The main difficulty of the challenge exists in the large degree of varying phonetic overlap between the potentially cross-lingual trials.
We introduce domain-balanced hard prototype mining to fine-tune the state-of-the-art ECAPA-TDNN x-vector based speaker embedding extractor.
- Score: 14.83348592874271
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper we describe the top-scoring IDLab submission for the
text-independent task of the Short-duration Speaker Verification (SdSV)
Challenge 2020. The main difficulty of the challenge exists in the large degree
of varying phonetic overlap between the potentially cross-lingual trials, along
with the limited availability of in-domain DeepMine Farsi training data. We
introduce domain-balanced hard prototype mining to fine-tune the
state-of-the-art ECAPA-TDNN x-vector based speaker embedding extractor. The
sample mining technique efficiently exploits speaker distances between the
speaker prototypes of the popular AAM-softmax loss function to construct
challenging training batches that are balanced on the domain-level. To enhance
the scoring of cross-lingual trials, we propose a language-dependent s-norm
score normalization. The imposter cohort only contains data from the Farsi
target-domain which simulates the enrollment data always being Farsi. In case a
Gaussian-Backend language model detects the test speaker embedding to contain
English, a cross-language compensation offset determined on the AAM-softmax
speaker prototypes is subtracted from the maximum expected imposter mean score.
A fusion of five systems with minor topological tweaks resulted in a final
MinDCF and EER of 0.065 and 1.45% respectively on the SdSVC evaluation set.
Related papers
- LLM-Based Evaluation of Low-Resource Machine Translation: A Reference-less Dialect Guided Approach with a Refined Sylheti-English Benchmark [1.3927943269211591]
We propose a comprehensive framework that enhances Large Language Models (LLMs)-based machine translation evaluation.<n>We extend the ONUBAD dataset by incorporating Sylheti-English sentence pairs, corresponding machine translations, and Direct Assessment (DA) scores annotated by native speakers.<n>Our evaluation shows that the proposed pipeline consistently outperforms existing methods, achieving the highest gain of +0.1083 in Spearman correlation.
arXiv Detail & Related papers (2025-05-18T07:24:13Z) - Whisper Speaker Identification: Leveraging Pre-Trained Multilingual Transformers for Robust Speaker Embeddings [0.0]
We propose WSI (Whisper Speaker Identification), a framework that repurposes the Whisper automatic speech recognition model pre trained on extensive multilingual data.
By capitalizing on Whisper language-agnostic acoustic representations, our approach effectively distinguishes speakers across diverse languages.
arXiv Detail & Related papers (2025-03-13T15:11:28Z) - Bilingual Text-dependent Speaker Verification with Pre-trained Models for TdSV Challenge 2024 [0.0]
We present our submissions to the Iranian division of the Text-dependent Speaker Verification Challenge (TdSV) 2024.
TdSV aims to determine if a specific phrase was spoken by a target speaker.
For phrase verification, a phrase rejected incorrect phrases, while for speaker verification, a pre-trained ResNet293 with domain adaptation extracted speaker embeddings.
Whisper-PMFA, a pre-trained ASR model adapted for speaker verification, falls short of the performance of pre-trained ResNets.
arXiv Detail & Related papers (2024-11-16T15:53:03Z) - OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion [88.59397418187226]
We propose a novel unified open-vocabulary detection method called OV-DINO.
It is pre-trained on diverse large-scale datasets with language-aware selective fusion in a unified framework.
We evaluate the performance of the proposed OV-DINO on popular open-vocabulary detection benchmarks.
arXiv Detail & Related papers (2024-07-10T17:05:49Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - CROP: Zero-shot Cross-lingual Named Entity Recognition with Multilingual
Labeled Sequence Translation [113.99145386490639]
Cross-lingual NER can transfer knowledge between languages via aligned cross-lingual representations or machine translation results.
We propose a Cross-lingual Entity Projection framework (CROP) to enable zero-shot cross-lingual NER.
We adopt a multilingual labeled sequence translation model to project the tagged sequence back to the target language and label the target raw sentence.
arXiv Detail & Related papers (2022-10-13T13:32:36Z) - Robustification of Multilingual Language Models to Real-world Noise with
Robust Contrastive Pretraining [14.087882550564169]
We assess the robustness of neural models on noisy data and suggest improvements are limited to the English language.
To benchmark the performance of pretrained multilingual models, we construct noisy datasets covering five languages and four NLP tasks.
We propose Robust Contrastive Pretraining (RCP) to boost the zero-shot cross-lingual robustness of multilingual pretrained models.
arXiv Detail & Related papers (2022-10-10T15:40:43Z) - Bridging Cross-Lingual Gaps During Leveraging the Multilingual
Sequence-to-Sequence Pretraining for Text Generation [80.16548523140025]
We extend the vanilla pretrain-finetune pipeline with extra code-switching restore task to bridge the gap between the pretrain and finetune stages.
Our approach could narrow the cross-lingual sentence representation distance and improve low-frequency word translation with trivial computational cost.
arXiv Detail & Related papers (2022-04-16T16:08:38Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - A Hierarchical Model for Spoken Language Recognition [29.948719321162883]
Spoken language recognition ( SLR) refers to the automatic process used to determine the language present in a speech sample.
We propose a novel hierarchical approach were two PLDA models are trained, one to generate scores for clusters of highly related languages and a second one to generate scores conditional to each cluster.
We show that this hierarchical approach consistently outperforms the non-hierarchical one for detection of highly related languages.
arXiv Detail & Related papers (2022-01-04T22:10:36Z) - On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments.
We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z) - Unsupervised Acoustic Unit Discovery by Leveraging a
Language-Independent Subword Discriminative Feature Representation [31.87235700253597]
This paper tackles automatically discovering phone-like acoustic units (AUD) from unlabeled speech data.
We propose a two-stage approach: the first stage learns a subword-discriminative feature representation and the second stage applies clustering to the learned representation and obtains phone-like clusters as the discovered acoustic units.
arXiv Detail & Related papers (2021-04-02T11:43:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.