The Royalflush System for VoxCeleb Speaker Recognition Challenge 2022
- URL: http://arxiv.org/abs/2209.09010v2
- Date: Tue, 20 Sep 2022 12:46:18 GMT
- Title: The Royalflush System for VoxCeleb Speaker Recognition Challenge 2022
- Authors: Jingguang Tian, Xinhui Hu, Xinkang Xu
- Abstract summary: We describe the Royalflush submissions for the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22)
For track 1, we develop a powerful U-Net-based speaker embedding extractor with a symmetric architecture.
For track 3, we employ the joint training of source domain supervision and target domain self-supervision to get a speaker embedding extractor.
- Score: 4.022057598291766
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In this technical report, we describe the Royalflush submissions for the
VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22). Our submissions
contain track 1, which is for supervised speaker verification and track 3,
which is for semi-supervised speaker verification. For track 1, we develop a
powerful U-Net-based speaker embedding extractor with a symmetric architecture.
The proposed system achieves 2.06% in EER and 0.1293 in MinDCF on the
validation set. Compared with the state-of-the-art ECAPA-TDNN, it obtains a
relative improvement of 20.7% in EER and 22.70% in MinDCF. For track 3, we
employ the joint training of source domain supervision and target domain
self-supervision to get a speaker embedding extractor. The subsequent
clustering process can obtain target domain pseudo-speaker labels. We adapt the
speaker embedding extractor using all source and target domain data in a
supervised manner, where it can fully leverage both domain information.
Moreover, clustering and supervised domain adaptation can be repeated until the
performance converges on the validation set. Our final submission is a fusion
of 10 models and achieves 7.75% EER and 0.3517 MinDCF on the validation set.
Related papers
- Exploring WavLM Back-ends for Speech Spoofing and Deepfake Detection [0.0]
ASVspoof 5 Challenge Track 1: Speech Deepfake Detection - Open Condition consists of a stand-alone speech deepfake (bonafide vs spoof) detection task.
We leverage a pre-trained WavLM as a front-end model and pool its representations with different back-end techniques.
Our fused system achieves 0.0937 minDCF, 3.42% EER, 0.1927 Cllr, and 0.1375 actDCF.
arXiv Detail & Related papers (2024-09-08T08:54:36Z) - TOLD: A Novel Two-Stage Overlap-Aware Framework for Speaker Diarization [54.41494515178297]
We reformulate speaker diarization as a single-label classification problem.
We propose the overlap-aware EEND (EEND-OLA) model, in which speaker overlaps and dependency can be modeled explicitly.
Compared with the original EEND, the proposed EEND-OLA achieves a 14.39% relative improvement in terms of diarization error rates.
arXiv Detail & Related papers (2023-03-08T05:05:26Z) - Exploring Self-supervised Pre-trained ASR Models For Dysarthric and
Elderly Speech Recognition [57.31233839489528]
This paper explores approaches to integrate domain adapted SSL pre-trained models into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition.
arXiv Detail & Related papers (2023-02-28T13:39:17Z) - The SpeakIn Speaker Verification System for Far-Field Speaker
Verification Challenge 2022 [15.453882034529913]
This paper describes speaker verification systems submitted to the Far-Field Speaker Verification Challenge 2022 (FFSVC2022)
The ResNet-based and RepVGG-based architectures were developed for this challenge.
Our approach leads to excellent performance and ranks 1st in both challenge tasks.
arXiv Detail & Related papers (2022-09-23T14:51:55Z) - Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method.
We first use wav2vec pre-trained model to obtain a high-level representation of the speech.
For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z) - Robust Speaker Recognition with Transformers Using wav2vec 2.0 [7.419725234099729]
This paper presents an investigation of using wav2vec 2.0 deep speech representations for the speaker recognition task.
It is concluded that Contrastive Predictive Coding pretraining scheme efficiently utilizes the power of unlabeled data.
arXiv Detail & Related papers (2022-03-28T20:59:58Z) - The Volcspeech system for the ICASSP 2022 multi-channel multi-party
meeting transcription challenge [18.33054364289739]
This paper describes our submission to ICASSP 2022 Multi-channel Multi-party Meeting Transcription (M2MeT) Challenge.
For Track 1, we propose several approaches to empower the clustering-based speaker diarization system.
For Track 2, we develop our system using the Conformer model in a joint CTC-attention architecture.
arXiv Detail & Related papers (2022-02-09T03:38:39Z) - STC speaker recognition systems for the NIST SRE 2021 [56.05258832139496]
This paper presents a description of STC Ltd. systems submitted to the NIST 2021 Speaker Recognition Evaluation.
These systems consists of a number of diverse subsystems based on using deep neural networks as feature extractors.
For video modality we developed our best solution with RetinaFace face detector and deep ResNet face embeddings extractor trained on large face image datasets.
arXiv Detail & Related papers (2021-11-03T15:31:01Z) - The xx205 System for the VoxCeleb Speaker Recognition Challenge 2020 [2.7920304852537536]
This report describes the systems submitted to the first and second tracks of the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2020.
The best submitted systems achieve an EER of $3.808%$ and a MinDCF of $0.1958$ in the close-condition track 1, and an EER of $3.798%$ and a MinDCF of $0.1942$ in the open-condition track 2, respectively.
arXiv Detail & Related papers (2020-10-31T06:36:26Z) - Device-Robust Acoustic Scene Classification Based on Two-Stage
Categorization and Data Augmentation [63.98724740606457]
We present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge.
Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes.
Task 1b concerns with classification of data into three higher-level classes using low-complexity solutions.
arXiv Detail & Related papers (2020-07-16T15:07:14Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.