The HUAWEI Speaker Diarisation System for the VoxCeleb Speaker
Diarisation Challenge
- URL: http://arxiv.org/abs/2010.11657v2
- Date: Fri, 23 Oct 2020 07:45:47 GMT
- Title: The HUAWEI Speaker Diarisation System for the VoxCeleb Speaker
Diarisation Challenge
- Authors: Renyu Wang, Ruilin Tong, Yu Ting Yeung, Xiao Chen
- Abstract summary: This paper describes system setup of our submission to speaker diarisation track (Track 4) of VoxCeleb Speaker Recognition Challenge 2020.
Our diarisation system consists of a well-trained neural network based speech enhancement model as pre-processing front-end of input speech signals.
- Score: 6.6238321827660345
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper describes system setup of our submission to speaker diarisation
track (Track 4) of VoxCeleb Speaker Recognition Challenge 2020. Our diarisation
system consists of a well-trained neural network based speech enhancement model
as pre-processing front-end of input speech signals. We replace conventional
energy-based voice activity detection (VAD) with a neural network based VAD.
The neural network based VAD provides more accurate annotation of speech
segments containing only background music, noise, and other interference, which
is crucial to diarisation performance. We apply agglomerative hierarchical
clustering (AHC) of x-vectors and variational Bayesian hidden Markov model
(VB-HMM) based iterative clustering for speaker clustering. Experimental
results demonstrate that our proposed system achieves substantial improvements
over the baseline system, yielding diarisation error rate (DER) of 10.45%, and
Jacard error rate (JER) of 22.46% on the evaluation set.
Related papers
- Unsupervised Speaker Diarization in Distributed IoT Networks Using Federated Learning [2.3076690318595676]
This paper presents a computationally efficient and distributed speaker diarization framework for networked IoT-style audio devices.
A Federated Learning model can identify the participants in a conversation without the requirement of a large audio database for training.
An unsupervised online update mechanism is proposed for the Federated Learning model which depends on cosine similarity of speaker embeddings.
arXiv Detail & Related papers (2024-04-16T18:40:28Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - STC speaker recognition systems for the NIST SRE 2021 [56.05258832139496]
This paper presents a description of STC Ltd. systems submitted to the NIST 2021 Speaker Recognition Evaluation.
These systems consists of a number of diverse subsystems based on using deep neural networks as feature extractors.
For video modality we developed our best solution with RetinaFace face detector and deep ResNet face embeddings extractor trained on large face image datasets.
arXiv Detail & Related papers (2021-11-03T15:31:01Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - Three-class Overlapped Speech Detection using a Convolutional Recurrent
Neural Network [32.59704287230343]
The proposed approach classifies into three classes: non-speech, single speaker speech, and overlapped speech.
A convolutional recurrent neural network architecture is explored to benefit from both convolutional layer's capability to model local patterns and recurrent layer's ability to model sequential information.
The proposed overlapped speech detection model establishes a state-of-the-art performance with a precision of 0.6648 and a recall of 0.3222 on the DIHARD II evaluation set.
arXiv Detail & Related papers (2021-04-07T03:01:34Z) - AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech.
Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times.
Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z) - LEAP System for SRE19 CTS Challenge -- Improvements and Error Analysis [36.35711634925221]
We provide a detailed account of the LEAP SRE system submitted to the CTS challenge.
All the systems used the time-delay neural network (TDNN) based x-vector embeddings.
The system combination of generative and neural PLDA models resulted in significant improvements for the SRE evaluation dataset.
arXiv Detail & Related papers (2020-02-07T12:28:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.