Royalflush Speaker Diarization System for ICASSP 2022 Multi-channel
Multi-party Meeting Transcription Challenge
- URL: http://arxiv.org/abs/2202.04814v1
- Date: Thu, 10 Feb 2022 03:35:05 GMT
- Title: Royalflush Speaker Diarization System for ICASSP 2022 Multi-channel
Multi-party Meeting Transcription Challenge
- Authors: Jingguang Tian, Xinhui Hu, Xinkang Xu
- Abstract summary: Royalflush speaker diarization system submitted to the Multi-channel Multi-party Meeting Transcription Challenge.
System comprises speech enhancement, overlapped speech detection, speaker embedding extraction, speaker clustering, speech separation and system fusion.
- Score: 4.022057598291766
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper describes the Royalflush speaker diarization system submitted to
the Multi-channel Multi-party Meeting Transcription Challenge. Our system
comprises speech enhancement, overlapped speech detection, speaker embedding
extraction, speaker clustering, speech separation and system fusion. In this
system, we made three contributions. First, we propose an architecture of
combining the multi-channel and U-Net-based models, aiming at utilizing the
benefits of these two individual architectures, for far-field overlapped speech
detection. Second, in order to use overlapped speech detection model to help
speaker diarization, a speech separation based overlapped speech handling
approach, in which the speaker verification technique is further applied, is
proposed. Third, we explore three speaker embedding methods, and obtained the
state-of-the-art performance on the CNCeleb-E test set. With these proposals,
our best individual system significantly reduces DER from 15.25% to 6.40%, and
the fusion of four systems finally achieves a DER of 6.30% on the far-field
Alimeeting evaluation set.
Related papers
- MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - In search of strong embedding extractors for speaker diarisation [49.7017388682077]
We tackle two key problems when adopting EEs for speaker diarisation.
First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation.
We show that better performance on widely adopted speaker verification evaluation protocols does not lead to better diarisation performance.
We propose two data augmentation techniques to alleviate the second problem, making embedding extractors aware of overlapped speech or speaker change input.
arXiv Detail & Related papers (2022-10-26T13:00:29Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Multi-scale Speaker Diarization with Dynamic Scale Weighting [14.473173007997751]
We propose a more advanced multi-scale diarization system based on a multi-scale diarization decoder.
Our proposed system achieves a state-of-art performance on the CALLHOME and AMI MixHeadset datasets, with 3.92% and 1.05% diarization error rates, respectively.
arXiv Detail & Related papers (2022-03-30T01:26:31Z) - The USTC-Ximalaya system for the ICASSP 2022 multi-channel multi-party
meeting transcription (M2MeT) challenge [43.262531688434215]
We propose two improvements to target-speaker voice activity detection (TS-VAD)
These techniques are designed to handle multi-speaker conversations in real-world meeting scenarios with high speaker-overlap ratios and under heavy reverberant and noisy condition.
arXiv Detail & Related papers (2022-02-10T06:06:48Z) - The Volcspeech system for the ICASSP 2022 multi-channel multi-party
meeting transcription challenge [18.33054364289739]
This paper describes our submission to ICASSP 2022 Multi-channel Multi-party Meeting Transcription (M2MeT) Challenge.
For Track 1, we propose several approaches to empower the clustering-based speaker diarization system.
For Track 2, we develop our system using the Conformer model in a joint CTC-attention architecture.
arXiv Detail & Related papers (2022-02-09T03:38:39Z) - A Real-time Speaker Diarization System Based on Spatial Spectrum [14.189768987932364]
We propose a novel systematic approach to tackle several long-standing challenges in speaker diarization tasks.
First, a differential directional microphone array-based approach is exploited to capture the target speakers' voice in far-field adverse environment.
Second, an online speaker-location joint clustering approach is proposed to keep track of speaker location.
Third, an instant speaker number detector is developed to trigger the mechanism that separates overlapped speech.
arXiv Detail & Related papers (2021-07-20T08:25:23Z) - Adapting Speaker Embeddings for Speaker Diarisation [30.383712356205084]
The goal of this paper is to adapt speaker embeddings for solving the problem of speaker diarisation.
We propose three techniques that can be used to better adapt the speaker embeddings for diarisation: dimensionality reduction, attention-based embedding aggregation, and non-speech clustering.
The results demonstrate that all three techniques contribute positively to the performance of the diarisation system achieving an average relative improvement of 25.07% in terms of diarisation error rate over the baseline.
arXiv Detail & Related papers (2021-04-07T03:04:47Z) - Investigating on Incorporating Pretrained and Learnable Speaker
Representations for Multi-Speaker Multi-Style Text-to-Speech [54.75722224061665]
In this work, we investigate different speaker representations and proposed to integrate pretrained and learnable speaker representations.
The FastSpeech 2 model combined with both pretrained and learnable speaker representations shows great generalization ability on few-shot speakers.
arXiv Detail & Related papers (2021-03-06T10:14:33Z) - Filling the Gap of Utterance-aware and Speaker-aware Representation for
Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles.
In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely.
We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z) - Identify Speakers in Cocktail Parties with End-to-End Attention [48.96655134462949]
This paper presents an end-to-end system that integrates speech source extraction and speaker identification.
We propose a new way to jointly optimize these two parts by max-pooling the speaker predictions along the channel dimension.
End-to-end training results in a system that recognizes one speaker in a two-speaker broadcast speech mixture with 99.9% accuracy and both speakers with 93.9% accuracy.
arXiv Detail & Related papers (2020-05-22T22:15:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.