DIHARD II is Still Hard: Experimental Results and Discussions from the
DKU-LENOVO Team
- URL: http://arxiv.org/abs/2002.12761v2
- Date: Tue, 5 May 2020 02:46:24 GMT
- Title: DIHARD II is Still Hard: Experimental Results and Discussions from the
DKU-LENOVO Team
- Authors: Qingjian Lin, Weicheng Cai, Lin Yang, Junjie Wang, Jun Zhang, Ming Li
- Abstract summary: We present the submitted system for the second DIHARD Speech Diarization Challenge from the DKULE team.
Our diarization system includes multiple modules, namely voice activity detection (VAD), segmentation, speaker embedding extraction, similarity scoring, clustering, resegmentation and overlap detection.
Although our systems have reduced the DERs by 27.5% and 31.7% relatively against the official baselines, we believe that the diarization task is still very difficult.
- Score: 22.657782236219933
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present the submitted system for the second DIHARD Speech
Diarization Challenge from the DKULENOVO team. Our diarization system includes
multiple modules, namely voice activity detection (VAD), segmentation, speaker
embedding extraction, similarity scoring, clustering, resegmentation and
overlap detection. For each module, we explore different techniques to enhance
performance. Our final submission employs the ResNet-LSTM based VAD, the Deep
ResNet based speaker embedding, the LSTM based similarity scoring and spectral
clustering. Variational Bayes (VB) diarization is applied in the resegmentation
stage and overlap detection also brings slight improvement. Our proposed system
achieves 18.84% DER in Track1 and 27.90% DER in Track2. Although our systems
have reduced the DERs by 27.5% and 31.7% relatively against the official
baselines, we believe that the diarization task is still very difficult.
Related papers
- TCG CREST System Description for the Second DISPLACE Challenge [19.387615374726444]
We describe the speaker diarization (SD) and language diarization (LD) systems developed by our team for the Second DISPLACE Challenge, 2024.
Our contributions were dedicated to Track 1 for SD and Track 2 for LD in multilingual and multi-speaker scenarios.
arXiv Detail & Related papers (2024-09-16T05:13:34Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - TOLD: A Novel Two-Stage Overlap-Aware Framework for Speaker Diarization [54.41494515178297]
We reformulate speaker diarization as a single-label classification problem.
We propose the overlap-aware EEND (EEND-OLA) model, in which speaker overlaps and dependency can be modeled explicitly.
Compared with the original EEND, the proposed EEND-OLA achieves a 14.39% relative improvement in terms of diarization error rates.
arXiv Detail & Related papers (2023-03-08T05:05:26Z) - Joint Speech Activity and Overlap Detection with Multi-Exit Architecture [5.4878772986187565]
Overlapped speech detection (OSD) is critical for speech applications in scenario of multi-party conversion.
This study investigates the joint VAD and OSD task from a new perspective.
In particular, we propose to extend traditional classification network with multi-exit architecture.
arXiv Detail & Related papers (2022-09-24T02:34:11Z) - The USTC-Ximalaya system for the ICASSP 2022 multi-channel multi-party
meeting transcription (M2MeT) challenge [43.262531688434215]
We propose two improvements to target-speaker voice activity detection (TS-VAD)
These techniques are designed to handle multi-speaker conversations in real-world meeting scenarios with high speaker-overlap ratios and under heavy reverberant and noisy condition.
arXiv Detail & Related papers (2022-02-10T06:06:48Z) - The Volcspeech system for the ICASSP 2022 multi-channel multi-party
meeting transcription challenge [18.33054364289739]
This paper describes our submission to ICASSP 2022 Multi-channel Multi-party Meeting Transcription (M2MeT) Challenge.
For Track 1, we propose several approaches to empower the clustering-based speaker diarization system.
For Track 2, we develop our system using the Conformer model in a joint CTC-attention architecture.
arXiv Detail & Related papers (2022-02-09T03:38:39Z) - STC speaker recognition systems for the NIST SRE 2021 [56.05258832139496]
This paper presents a description of STC Ltd. systems submitted to the NIST 2021 Speaker Recognition Evaluation.
These systems consists of a number of diverse subsystems based on using deep neural networks as feature extractors.
For video modality we developed our best solution with RetinaFace face detector and deep ResNet face embeddings extractor trained on large face image datasets.
arXiv Detail & Related papers (2021-11-03T15:31:01Z) - Disentangle Your Dense Object Detector [82.22771433419727]
Deep learning-based dense object detectors have achieved great success in the past few years and have been applied to numerous multimedia applications such as video understanding.
However, the current training pipeline for dense detectors is compromised to lots of conjunctions that may not hold.
We propose Disentangled Dense Object Detector (DDOD), in which simple and effective disentanglement mechanisms are designed and integrated into the current state-of-the-art detectors.
arXiv Detail & Related papers (2021-07-07T00:52:16Z) - USTC-NELSLIP System Description for DIHARD-III Challenge [78.40959509760488]
The innovation of our system lies in the combination of various front-end techniques to solve the diarization problem.
Our best system achieved DERs of 11.30% in track 1 and 16.78% in track 2 on evaluation set.
arXiv Detail & Related papers (2021-03-19T07:00:51Z) - Effects of Word-frequency based Pre- and Post- Processings for Audio
Captioning [49.41766997393417]
The system we used for Task 6 (Automated Audio Captioning)of the Detection and Classification of Acoustic Scenes and Events(DCASE) 2020 Challenge combines three elements, namely, dataaugmentation, multi-task learning, and post-processing, for audiocaptioning.
The system received the highest evaluation scores, but which of the individual elements most fully contributed to its perfor-mance has not yet been clarified.
arXiv Detail & Related papers (2020-09-24T01:07:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.