The xx205 System for the VoxCeleb Speaker Recognition Challenge 2020
- URL: http://arxiv.org/abs/2011.00200v1
- Date: Sat, 31 Oct 2020 06:36:26 GMT
- Title: The xx205 System for the VoxCeleb Speaker Recognition Challenge 2020
- Authors: Xu Xiang
- Abstract summary: This report describes the systems submitted to the first and second tracks of the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2020.
The best submitted systems achieve an EER of $3.808%$ and a MinDCF of $0.1958$ in the close-condition track 1, and an EER of $3.798%$ and a MinDCF of $0.1942$ in the open-condition track 2, respectively.
- Score: 2.7920304852537536
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This report describes the systems submitted to the first and second tracks of
the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2020, which ranked second
in both tracks. Three key points of the system pipeline are explored: (1)
investigating multiple CNN architectures including ResNet, Res2Net and dual
path network (DPN) to extract the x-vectors, (2) using a composite angular
margin softmax loss to train the speaker models, and (3) applying score
normalization and system fusion to boost the performance. Measured on the
VoxSRC-20 Eval set, the best submitted systems achieve an EER of $3.808\%$ and
a MinDCF of $0.1958$ in the close-condition track 1, and an EER of $3.798\%$
and a MinDCF of $0.1942$ in the open-condition track 2, respectively.
Related papers
- The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in
CNVSRC 2023 [67.11294606070278]
This paper delineates the visual speech recognition (VSR) system introduced by the NPU-ASLP-LiAuto (Team 237) in the first Chinese Continuous Visual Speech Recognition Challenge (CNVSRC) 2023.
In terms of data processing, we leverage the lip motion extractor from the baseline1 to produce multi-scale video data.
Various augmentation techniques are applied during training, encompassing speed perturbation, random rotation, horizontal flipping, and color transformation.
arXiv Detail & Related papers (2024-01-07T14:20:52Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - UNISOUND System for VoxCeleb Speaker Recognition Challenge 2023 [11.338256222745429]
This report describes the UNISOUND submission for Track1 and Track2 of VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC 2023)
We submit the same system on Track 1 and Track 2, which is trained with only VoxCeleb2-dev.
We propose a consistency-aware score calibration method, which leverages the stability of audio voiceprints in similarity score by a Consistency Measure Factor (CMF)
arXiv Detail & Related papers (2023-08-24T03:30:38Z) - The DKU-DUKEECE System for the Manipulation Region Location Task of ADD
2023 [12.69800199589029]
This paper introduces our system designed for Track 2 of the Audio Deepfake Detection Challenge (ADD 2023)
Our top-performing solution achieves an impressive 82.23% sentence accuracy and an F1 score of 60.66%.
This results in a final ADD score of 0.6713, securing the first rank in Track 2 of ADD 2023.
arXiv Detail & Related papers (2023-08-20T14:29:04Z) - The Royalflush System for VoxCeleb Speaker Recognition Challenge 2022 [4.022057598291766]
We describe the Royalflush submissions for the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22)
For track 1, we develop a powerful U-Net-based speaker embedding extractor with a symmetric architecture.
For track 3, we employ the joint training of source domain supervision and target domain self-supervision to get a speaker embedding extractor.
arXiv Detail & Related papers (2022-09-19T13:35:36Z) - Two-pass Decoding and Cross-adaptation Based System Combination of
End-to-end Conformer and Hybrid TDNN ASR Systems [61.90743116707422]
This paper investigates multi-pass rescoring and cross adaptation based system combination approaches for hybrid TDNN and Conformer E2E ASR systems.
The best combined system obtained using multi-pass rescoring produced statistically significant word error rate (WER) reductions of 2.5% to 3.9% absolute (22.5% to 28.9% relative) over the stand alone Conformer system on the NIST Hub5'00, Rt03 and Rt02 evaluation data.
arXiv Detail & Related papers (2022-06-23T10:17:13Z) - STC speaker recognition systems for the NIST SRE 2021 [56.05258832139496]
This paper presents a description of STC Ltd. systems submitted to the NIST 2021 Speaker Recognition Evaluation.
These systems consists of a number of diverse subsystems based on using deep neural networks as feature extractors.
For video modality we developed our best solution with RetinaFace face detector and deep ResNet face embeddings extractor trained on large face image datasets.
arXiv Detail & Related papers (2021-11-03T15:31:01Z) - Disentangle Your Dense Object Detector [82.22771433419727]
Deep learning-based dense object detectors have achieved great success in the past few years and have been applied to numerous multimedia applications such as video understanding.
However, the current training pipeline for dense detectors is compromised to lots of conjunctions that may not hold.
We propose Disentangled Dense Object Detector (DDOD), in which simple and effective disentanglement mechanisms are designed and integrated into the current state-of-the-art detectors.
arXiv Detail & Related papers (2021-07-07T00:52:16Z) - ShaneRun System Description to VoxCeleb Speaker Recognition Challenge
2020 [3.0712335337791288]
We describe the submission of ShaneRun's team to the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2020.
We use ResNet-34 as encoder to extract the speaker embeddings, which is referenced from the open-source voxceleb-trainer.
The final submitted system got 0.3098 minDCF and 5.076 % ERR for Fixed data track, which outperformed the baseline by 1.3 % minDCF and 2.2 % ERR respectively.
arXiv Detail & Related papers (2020-11-03T07:26:21Z) - A Two-Stage Approach to Device-Robust Acoustic Scene Classification [63.98724740606457]
Two-stage system based on fully convolutional neural networks (CNNs) is proposed to improve device robustness.
Our results show that the proposed ASC system attains a state-of-the-art accuracy on the development set.
Neural saliency analysis with class activation mapping gives new insights on the patterns learnt by our models.
arXiv Detail & Related papers (2020-11-03T03:27:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.