The SpeakIn Speaker Verification System for Far-Field Speaker
Verification Challenge 2022
- URL: http://arxiv.org/abs/2209.11625v1
- Date: Fri, 23 Sep 2022 14:51:55 GMT
- Title: The SpeakIn Speaker Verification System for Far-Field Speaker
Verification Challenge 2022
- Authors: Yu Zheng, Jinghan Peng, Yihao Chen, Yajun Zhang, Jialong Wang, Min
Liu, Minqiang Xu
- Abstract summary: This paper describes speaker verification systems submitted to the Far-Field Speaker Verification Challenge 2022 (FFSVC2022)
The ResNet-based and RepVGG-based architectures were developed for this challenge.
Our approach leads to excellent performance and ranks 1st in both challenge tasks.
- Score: 15.453882034529913
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper describes speaker verification (SV) systems submitted by the
SpeakIn team to the Task 1 and Task 2 of the Far-Field Speaker Verification
Challenge 2022 (FFSVC2022). SV tasks of the challenge focus on the problem of
fully supervised far-field speaker verification (Task 1) and semi-supervised
far-field speaker verification (Task 2). In Task 1, we used the VoxCeleb and
FFSVC2020 datasets as train datasets. And for Task 2, we only used the VoxCeleb
dataset as train set. The ResNet-based and RepVGG-based architectures were
developed for this challenge. Global statistic pooling structure and MQMHA
pooling structure were used to aggregate the frame-level features across time
to obtain utterance-level representation. We adopted AM-Softmax and AAM-Softmax
to classify the resulting embeddings. We innovatively propose a staged transfer
learning method. In the pre-training stage we reserve the speaker weights, and
there are no positive samples to train them in this stage. Then we fine-tune
these weights with both positive and negative samples in the second stage.
Compared with the traditional transfer learning strategy, this strategy can
better improve the model performance. The Sub-Mean and AS-Norm backend methods
were used to solve the problem of domain mismatch. In the fusion stage, three
models were fused in Task1 and two models were fused in Task2. On the FFSVC2022
leaderboard, the EER of our submission is 3.0049% and the corresponding minDCF
is 0.2938 in Task1. In Task2, EER and minDCF are 6.2060% and 0.5232
respectively. Our approach leads to excellent performance and ranks 1st in both
challenge tasks.
Related papers
- eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Improving Cross-task Generalization of Unified Table-to-text Models with
Compositional Task Configurations [63.04466647849211]
Methods typically encode task information with a simple dataset name as a prefix to the encoder.
We propose compositional task configurations, a set of prompts prepended to the encoder to improve cross-task generalization.
We show this not only allows the model to better learn shared knowledge across different tasks at training, but also allows us to control the model by composing new configurations.
arXiv Detail & Related papers (2022-12-17T02:20:14Z) - M3ST: Mix at Three Levels for Speech Translation [66.71994367650461]
We propose Mix at three levels for Speech Translation (M3ST) method to increase the diversity of the augmented training corpus.
In the first stage of fine-tuning, we mix the training corpus at three levels, including word level, sentence level and frame level, and fine-tune the entire model with mixed data.
Experiments on MuST-C speech translation benchmark and analysis show that M3ST outperforms current strong baselines and achieves state-of-the-art results on eight directions with an average BLEU of 29.9.
arXiv Detail & Related papers (2022-12-07T14:22:00Z) - Toward Efficient Language Model Pretraining and Downstream Adaptation
via Self-Evolution: A Case Study on SuperGLUE [203.65227947509933]
This report describes our JDExplore d-team's Vega v2 submission on the SuperGLUE leaderboard.
SuperGLUE is more challenging than the widely used general language understanding evaluation (GLUE) benchmark, containing eight difficult language understanding tasks.
arXiv Detail & Related papers (2022-12-04T15:36:18Z) - Cross-Modal Adapter for Text-Video Retrieval [91.9575196703281]
We present a novel $textbfCross-Modal Adapter$ for parameter-efficient fine-tuning.
Inspired by adapter-based methods, we adjust the pre-trained model with a few parameterization layers.
It achieves superior or comparable performance compared to fully fine-tuned methods on MSR-VTT, MSVD, VATEX, ActivityNet, and DiDeMo datasets.
arXiv Detail & Related papers (2022-11-17T16:15:30Z) - The SpeakIn System Description for CNSRC2022 [14.173172568687413]
This report describes our speaker verification systems for the tasks of the CN-Celeb Speaker Recognition Challenge 2022 (CNSRC 2022)
The challenge includes two tasks, namely speaker verification(SV) and speaker retrieval(SR)
arXiv Detail & Related papers (2022-09-22T08:17:47Z) - The ReturnZero System for VoxCeleb Speaker Recognition Challenge 2022 [0.0]
We describe the top-scoring submissions for team RTZR VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22)
The top performed system is a fusion of 7 models, which contains 3 different types of model architectures.
The final submission achieves 0.165 DCF and 2.912% EER on the VoxSRC22 test set.
arXiv Detail & Related papers (2022-09-21T06:54:24Z) - The Royalflush System for VoxCeleb Speaker Recognition Challenge 2022 [4.022057598291766]
We describe the Royalflush submissions for the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22)
For track 1, we develop a powerful U-Net-based speaker embedding extractor with a symmetric architecture.
For track 3, we employ the joint training of source domain supervision and target domain self-supervision to get a speaker embedding extractor.
arXiv Detail & Related papers (2022-09-19T13:35:36Z) - Phonemer at WNUT-2020 Task 2: Sequence Classification Using COVID
Twitter BERT and Bagging Ensemble Technique based on Plurality Voting [0.0]
We develop a system that automatically identifies whether an English Tweet related to the novel coronavirus (COVID-19) is informative or not.
Our final approach achieved an F1-score of 0.9037 and we were ranked sixth overall with F1-score as the evaluation criteria.
arXiv Detail & Related papers (2020-10-01T10:54:54Z) - Device-Robust Acoustic Scene Classification Based on Two-Stage
Categorization and Data Augmentation [63.98724740606457]
We present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge.
Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes.
Task 1b concerns with classification of data into three higher-level classes using low-complexity solutions.
arXiv Detail & Related papers (2020-07-16T15:07:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.