TOLD: A Novel Two-Stage Overlap-Aware Framework for Speaker Diarization
- URL: http://arxiv.org/abs/2303.05397v2
- Date: Wed, 13 Dec 2023 12:03:39 GMT
- Title: TOLD: A Novel Two-Stage Overlap-Aware Framework for Speaker Diarization
- Authors: Jiaming Wang, Zhihao Du, Shiliang Zhang
- Abstract summary: We reformulate speaker diarization as a single-label classification problem.
We propose the overlap-aware EEND (EEND-OLA) model, in which speaker overlaps and dependency can be modeled explicitly.
Compared with the original EEND, the proposed EEND-OLA achieves a 14.39% relative improvement in terms of diarization error rates.
- Score: 54.41494515178297
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, end-to-end neural diarization (EEND) is introduced and achieves
promising results in speaker-overlapped scenarios. In EEND, speaker diarization
is formulated as a multi-label prediction problem, where speaker activities are
estimated independently and their dependency are not well considered. To
overcome these disadvantages, we employ the power set encoding to reformulate
speaker diarization as a single-label classification problem and propose the
overlap-aware EEND (EEND-OLA) model, in which speaker overlaps and dependency
can be modeled explicitly. Inspired by the success of two-stage hybrid systems,
we further propose a novel Two-stage OverLap-aware Diarization framework (TOLD)
by involving a speaker overlap-aware post-processing (SOAP) model to
iteratively refine the diarization results of EEND-OLA. Experimental results
show that, compared with the original EEND, the proposed EEND-OLA achieves a
14.39% relative improvement in terms of diarization error rates (DER), and
utilizing SOAP provides another 19.33% relative improvement. As a result, our
method TOLD achieves a DER of 10.14% on the CALLHOME dataset, which is a new
state-of-the-art result on this benchmark to the best of our knowledge.
Related papers
- DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification [55.306583814017046]
We present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification.
DASA generates diversified training samples in speaker embedding space with negligible extra computing cost.
The best result achieves a 14.6% relative reduction in EER metric on CN-Celeb evaluation set.
arXiv Detail & Related papers (2023-10-18T17:07:05Z) - A Deliberation-based Joint Acoustic and Text Decoder [25.37972380217875]
We propose a new two-pass E2E speech recognition model that improves ASR performance by training on a combination of paired data and unpaired text data.
Our method, dubbed Deliberation-JATD, combines the spelling correcting abilities of deliberation with JATD's use of unpaired text data to further improve performance.
arXiv Detail & Related papers (2023-03-23T18:02:23Z) - X-SepFormer: End-to-end Speaker Extraction Network with Explicit
Optimization on Speaker Confusion [5.4878772986187565]
We present an end-to-end TSE model with proposed loss schemes and a backbone of SepFormer.
With SI-SDRi of 19.4 dB and PESQ of 3.81, our best system significantly outperforms the current SOTA systems.
arXiv Detail & Related papers (2023-03-09T04:00:29Z) - A Comparative Study on Speaker-attributed Automatic Speech Recognition
in Multi-party Meetings [53.120885867427305]
Three approaches are evaluated for speaker-attributed automatic speech recognition (SA-ASR) in a meeting scenario.
The WD-SOT approach achieves 10.7% relative reduction on averaged speaker-dependent character error rate (SD-CER)
The TS-ASR approach also outperforms the FD-SOT approach and brings 16.5% relative average SD-CER reduction.
arXiv Detail & Related papers (2022-03-31T06:39:14Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - The Volcspeech system for the ICASSP 2022 multi-channel multi-party
meeting transcription challenge [18.33054364289739]
This paper describes our submission to ICASSP 2022 Multi-channel Multi-party Meeting Transcription (M2MeT) Challenge.
For Track 1, we propose several approaches to empower the clustering-based speaker diarization system.
For Track 2, we develop our system using the Conformer model in a joint CTC-attention architecture.
arXiv Detail & Related papers (2022-02-09T03:38:39Z) - Deliberation Model Based Two-Pass End-to-End Speech Recognition [52.45841282906516]
A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model.
The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses.
A bidirectional encoder is used to extract context information from first-pass hypotheses.
arXiv Detail & Related papers (2020-03-17T22:01:12Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.