UNISOUND System for VoxCeleb Speaker Recognition Challenge 2023
- URL: http://arxiv.org/abs/2308.12526v1
- Date: Thu, 24 Aug 2023 03:30:38 GMT
- Title: UNISOUND System for VoxCeleb Speaker Recognition Challenge 2023
- Authors: Yu Zheng, Yajun Zhang, Chuanying Niu, Yibin Zhan, Yanhua Long,
Dongxing Xu
- Abstract summary: This report describes the UNISOUND submission for Track1 and Track2 of VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC 2023)
We submit the same system on Track 1 and Track 2, which is trained with only VoxCeleb2-dev.
We propose a consistency-aware score calibration method, which leverages the stability of audio voiceprints in similarity score by a Consistency Measure Factor (CMF)
- Score: 11.338256222745429
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This report describes the UNISOUND submission for Track1 and Track2 of
VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC 2023). We submit the same
system on Track 1 and Track 2, which is trained with only VoxCeleb2-dev.
Large-scale ResNet and RepVGG architectures are developed for the challenge. We
propose a consistency-aware score calibration method, which leverages the
stability of audio voiceprints in similarity score by a Consistency Measure
Factor (CMF). CMF brings a huge performance boost in this challenge. Our final
system is a fusion of six models and achieves the first place in Track 1 and
second place in Track 2 of VoxSRC 2023. The minDCF of our submission is 0.0855
and the EER is 1.5880%.
Related papers
- The NPU-HWC System for the ISCSLP 2024 Inspirational and Convincing Audio Generation Challenge [12.862628838633396]
This paper presents the NPU-HWC system submitted to the ISCSLP 2024 Inspirational and Convincing Audio Generation Challenge 2024 (ICAGC)
Our system consists of two modules: a speech generator for Track 1 and a background audio generator for Track 2.
arXiv Detail & Related papers (2024-10-31T10:58:59Z) - The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report [180.94772271910315]
This paper reviews the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions.
The primary objective is to develop networks that optimize various aspects such as runtime, parameters, and FLOPs.
The challenge had 262 registered participants, and 34 teams made valid submissions.
arXiv Detail & Related papers (2024-04-16T07:26:20Z) - ICMC-ASR: The ICASSP 2024 In-Car Multi-Channel Automatic Speech
Recognition Challenge [94.13624830833314]
This challenge collects over 100 hours of multi-channel speech data recorded inside a new energy vehicle.
First-place team USTCiflytek achieves a CER of 13.16% in the ASR track and a cpCER of 21.48% in the ASDR track.
arXiv Detail & Related papers (2024-01-07T12:51:42Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - ChinaTelecom System Description to VoxCeleb Speaker Recognition
Challenge 2023 [7.764294108093176]
Our system consists of several ResNet variants trained only on VoxCeleb2, which were fused for better performance later.
The final submission achieved minDCF of 0.1066 and EER of 1.980%.
arXiv Detail & Related papers (2023-08-16T07:21:01Z) - Towards single integrated spoofing-aware speaker verification embeddings [63.42889348690095]
This study aims to develop a single integrated spoofing-aware speaker verification embeddings.
We analyze that the inferior performance of single SASV embeddings comes from insufficient amount of training data.
Experiments show dramatic improvements, achieving a SASV-EER of 1.06% on the evaluation protocol of the SASV2022 challenge.
arXiv Detail & Related papers (2023-05-30T14:15:39Z) - The ReturnZero System for VoxCeleb Speaker Recognition Challenge 2022 [0.0]
We describe the top-scoring submissions for team RTZR VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22)
The top performed system is a fusion of 7 models, which contains 3 different types of model architectures.
The final submission achieves 0.165 DCF and 2.912% EER on the VoxSRC22 test set.
arXiv Detail & Related papers (2022-09-21T06:54:24Z) - The Royalflush System for VoxCeleb Speaker Recognition Challenge 2022 [4.022057598291766]
We describe the Royalflush submissions for the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22)
For track 1, we develop a powerful U-Net-based speaker embedding extractor with a symmetric architecture.
For track 3, we employ the joint training of source domain supervision and target domain self-supervision to get a speaker embedding extractor.
arXiv Detail & Related papers (2022-09-19T13:35:36Z) - The Volcspeech system for the ICASSP 2022 multi-channel multi-party
meeting transcription challenge [18.33054364289739]
This paper describes our submission to ICASSP 2022 Multi-channel Multi-party Meeting Transcription (M2MeT) Challenge.
For Track 1, we propose several approaches to empower the clustering-based speaker diarization system.
For Track 2, we develop our system using the Conformer model in a joint CTC-attention architecture.
arXiv Detail & Related papers (2022-02-09T03:38:39Z) - NTIRE 2021 Multi-modal Aerial View Object Classification Challenge [88.89190054948325]
We introduce the first Challenge on Multi-modal Aerial View Object Classification (MAVOC) in conjunction with the NTIRE 2021 workshop at CVPR.
This challenge is composed of two different tracks using EO and SAR imagery.
We discuss the top methods submitted for this competition and evaluate their results on our blind test set.
arXiv Detail & Related papers (2021-07-02T16:55:08Z) - The AS-NU System for the M2VoC Challenge [49.12981125333458]
This paper describes the AS-NU systems for two tracks in MultiSpeaker Multi-Style Voice Cloning Challenge (M2VoC)
The first track focuses on using a small number of 100 target utterances for voice cloning, while the second track focuses on using only 5 target utterances for voice cloning.
Due to the serious lack of data in the second track, we selected the speaker most similar to the target speaker from the training data of the TTS system, and used the speaker's utterances and the given 5 target utterances to fine-tune our model.
arXiv Detail & Related papers (2021-04-07T09:26:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.