The Volcspeech system for the ICASSP 2022 multi-channel multi-party
meeting transcription challenge
- URL: http://arxiv.org/abs/2202.04261v2
- Date: Thu, 10 Feb 2022 02:58:07 GMT
- Title: The Volcspeech system for the ICASSP 2022 multi-channel multi-party
meeting transcription challenge
- Authors: Chen Shen, Yi Liu, Wenzhi Fan, Bin Wang, Shixue Wen, Yao Tian, Jun
Zhang, Jingsheng Yang, Zejun Ma
- Abstract summary: This paper describes our submission to ICASSP 2022 Multi-channel Multi-party Meeting Transcription (M2MeT) Challenge.
For Track 1, we propose several approaches to empower the clustering-based speaker diarization system.
For Track 2, we develop our system using the Conformer model in a joint CTC-attention architecture.
- Score: 18.33054364289739
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper describes our submission to ICASSP 2022 Multi-channel Multi-party
Meeting Transcription (M2MeT) Challenge. For Track 1, we propose several
approaches to empower the clustering-based speaker diarization system to handle
overlapped speech. Front-end dereverberation and the direction-of-arrival (DOA)
estimation are used to improve the accuracy of speaker diarization.
Multi-channel combination and overlap detection are applied to reduce the
missed speaker error. A modified DOVER-Lap is also proposed to fuse the results
of different systems. We achieve the final DER of 5.79% on the Eval set and
7.23% on the Test set. For Track 2, we develop our system using the Conformer
model in a joint CTC-attention architecture. Serialized output training is
adopted to multi-speaker overlapped speech recognition. We propose a neural
front-end module to model multi-channel audio and train the model end-to-end.
Various data augmentation methods are utilized to mitigate over-fitting in the
multi-channel multi-speaker E2E system. Transformer language model fusion is
developed to achieve better performance. The final CER is 19.2% on the Eval set
and 20.8% on the Test set.
Related papers
- MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - End-to-end Multichannel Speaker-Attributed ASR: Speaker Guided Decoder
and Input Feature Analysis [0.0]
We present an end-to-end multichannel speaker-attributed automatic speech recognition (MC-SA-ASR) system that combines a Conformer-based encoder with multi-frame crosschannel attention and a speaker-attributed Transformer-based decoder.
arXiv Detail & Related papers (2023-10-16T06:40:18Z) - TOLD: A Novel Two-Stage Overlap-Aware Framework for Speaker Diarization [54.41494515178297]
We reformulate speaker diarization as a single-label classification problem.
We propose the overlap-aware EEND (EEND-OLA) model, in which speaker overlaps and dependency can be modeled explicitly.
Compared with the original EEND, the proposed EEND-OLA achieves a 14.39% relative improvement in terms of diarization error rates.
arXiv Detail & Related papers (2023-03-08T05:05:26Z) - Two-pass Decoding and Cross-adaptation Based System Combination of
End-to-end Conformer and Hybrid TDNN ASR Systems [61.90743116707422]
This paper investigates multi-pass rescoring and cross adaptation based system combination approaches for hybrid TDNN and Conformer E2E ASR systems.
The best combined system obtained using multi-pass rescoring produced statistically significant word error rate (WER) reductions of 2.5% to 3.9% absolute (22.5% to 28.9% relative) over the stand alone Conformer system on the NIST Hub5'00, Rt03 and Rt02 evaluation data.
arXiv Detail & Related papers (2022-06-23T10:17:13Z) - The USTC-Ximalaya system for the ICASSP 2022 multi-channel multi-party
meeting transcription (M2MeT) challenge [43.262531688434215]
We propose two improvements to target-speaker voice activity detection (TS-VAD)
These techniques are designed to handle multi-speaker conversations in real-world meeting scenarios with high speaker-overlap ratios and under heavy reverberant and noisy condition.
arXiv Detail & Related papers (2022-02-10T06:06:48Z) - Royalflush Speaker Diarization System for ICASSP 2022 Multi-channel
Multi-party Meeting Transcription Challenge [4.022057598291766]
Royalflush speaker diarization system submitted to the Multi-channel Multi-party Meeting Transcription Challenge.
System comprises speech enhancement, overlapped speech detection, speaker embedding extraction, speaker clustering, speech separation and system fusion.
arXiv Detail & Related papers (2022-02-10T03:35:05Z) - The RoyalFlush System of Speech Recognition for M2MeT Challenge [5.863625637354342]
This paper describes our RoyalFlush system for the track of multi-speaker automatic speech recognition (ASR) in the M2MeT challenge.
We adopted the serialized output training (SOT) based multi-speakers ASR system with large-scale simulation data.
Our system got a 12.22% absolute Character Error Rate (CER) reduction on the validation set and 12.11% on the test set.
arXiv Detail & Related papers (2022-02-03T14:38:26Z) - Multi-turn RNN-T for streaming recognition of multi-party speech [2.899379040028688]
This work takes real-time applicability as the first priority in model design and addresses a few challenges in previous work on multi-speaker recurrent neural network transducer (MS-RNN-T)
We introduce on-the-fly overlapping speech simulation during training, yielding 14% relative word error rate (WER) improvement on LibriSpeechMix test set.
We propose a novel multi-turn RNN-T (MT-RNN-T) model with an overlap-based target arrangement strategy that generalizes to an arbitrary number of speakers without changes in the model architecture.
arXiv Detail & Related papers (2021-12-19T17:22:58Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - Audio-visual Multi-channel Recognition of Overlapped Speech [79.21950701506732]
This paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end.
Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 dataset respectively.
arXiv Detail & Related papers (2020-05-18T10:31:19Z) - End-to-End Multi-speaker Speech Recognition with Transformer [88.22355110349933]
We replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture.
We also modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation.
arXiv Detail & Related papers (2020-02-10T16:29:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.