The Hitachi-JHU DIHARD III System: Competitive End-to-End Neural
Diarization and X-Vector Clustering Systems Combined by DOVER-Lap
- URL: http://arxiv.org/abs/2102.01363v1
- Date: Tue, 2 Feb 2021 07:30:44 GMT
- Title: The Hitachi-JHU DIHARD III System: Competitive End-to-End Neural
Diarization and X-Vector Clustering Systems Combined by DOVER-Lap
- Authors: Shota Horiguchi, Nelson Yalta, Paola Garcia, Yuki Takashima, Yawen
Xue, Desh Raj, Zili Huang, Yusuke Fujita, Shinji Watanabe, Sanjeev Khudanpur
- Abstract summary: This paper provides a detailed description of the Hitachi-JHU system that was submitted to the Third DIHARD Speech Diarization Challenge.
The system outputs the ensemble results of the five subsystems: two x-vector-based subsystems, two end-to-end neural diarization-based subsystems, and one hybrid subsystem.
- Score: 67.395341302752
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper provides a detailed description of the Hitachi-JHU system that was
submitted to the Third DIHARD Speech Diarization Challenge. The system outputs
the ensemble results of the five subsystems: two x-vector-based subsystems, two
end-to-end neural diarization-based subsystems, and one hybrid subsystem. We
refine each system and all five subsystems become competitive and
complementary. After the DOVER-Lap based system combination, it achieved
diarization error rates of 11.58 % and 14.09 % in Track 1 full and core, and
16.94 % and 20.01 % in Track 2 full and core, respectively. With their results,
we won second place in all the tasks of the challenge.
Related papers
- Distilling System 2 into System 1 [35.194258450176534]
Large language models (LLMs) can spend extra compute during inference to generate intermediate thoughts.
We show that several such techniques can be successfully distilled, resulting in improved results compared to the original System 1 performance.
arXiv Detail & Related papers (2024-07-08T15:17:46Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Two-pass Decoding and Cross-adaptation Based System Combination of
End-to-end Conformer and Hybrid TDNN ASR Systems [61.90743116707422]
This paper investigates multi-pass rescoring and cross adaptation based system combination approaches for hybrid TDNN and Conformer E2E ASR systems.
The best combined system obtained using multi-pass rescoring produced statistically significant word error rate (WER) reductions of 2.5% to 3.9% absolute (22.5% to 28.9% relative) over the stand alone Conformer system on the NIST Hub5'00, Rt03 and Rt02 evaluation data.
arXiv Detail & Related papers (2022-06-23T10:17:13Z) - Investigations on Speech Recognition Systems for Low-Resource Dialectal
Arabic-English Code-Switching Speech [32.426525641734344]
We present our work on code-switched Egyptian Arabic-English automatic speech recognition (ASR)
We build our ASR systems using DNN-based hybrid and Transformer-based end-to-end models.
We show that recognition can be improved by combining the outputs of both systems.
arXiv Detail & Related papers (2021-08-29T17:23:30Z) - Joint System-Wise Optimization for Pipeline Goal-Oriented Dialog System [76.22810715401147]
We propose new joint system-wise optimization techniques for the pipeline dialog system.
First, we propose a new data augmentation approach which automates the labeling process for NLU training.
Second, we propose a novel policy parameterization with Poisson distribution that enables better exploration and offers a way to compute policy gradient.
arXiv Detail & Related papers (2021-06-09T06:44:57Z) - USTC-NELSLIP System Description for DIHARD-III Challenge [78.40959509760488]
The innovation of our system lies in the combination of various front-end techniques to solve the diarization problem.
Our best system achieved DERs of 11.30% in track 1 and 16.78% in track 2 on evaluation set.
arXiv Detail & Related papers (2021-03-19T07:00:51Z) - Effects of Word-frequency based Pre- and Post- Processings for Audio
Captioning [49.41766997393417]
The system we used for Task 6 (Automated Audio Captioning)of the Detection and Classification of Acoustic Scenes and Events(DCASE) 2020 Challenge combines three elements, namely, dataaugmentation, multi-task learning, and post-processing, for audiocaptioning.
The system received the highest evaluation scores, but which of the individual elements most fully contributed to its perfor-mance has not yet been clarified.
arXiv Detail & Related papers (2020-09-24T01:07:33Z) - DIHARD II is Still Hard: Experimental Results and Discussions from the
DKU-LENOVO Team [22.657782236219933]
We present the submitted system for the second DIHARD Speech Diarization Challenge from the DKULE team.
Our diarization system includes multiple modules, namely voice activity detection (VAD), segmentation, speaker embedding extraction, similarity scoring, clustering, resegmentation and overlap detection.
Although our systems have reduced the DERs by 27.5% and 31.7% relatively against the official baselines, we believe that the diarization task is still very difficult.
arXiv Detail & Related papers (2020-02-23T11:50:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.