Improving Speaker Assignment in Speaker-Attributed ASR for Real Meeting Applications
- URL: http://arxiv.org/abs/2403.06570v2
- Date: Thu, 5 Sep 2024 07:46:09 GMT
- Title: Improving Speaker Assignment in Speaker-Attributed ASR for Real Meeting Applications
- Authors: Can Cui, Imran Ahamad Sheikh, Mostafa Sadeghi, Emmanuel Vincent,
- Abstract summary: We present a novel study aiming to optimize the use of a Speaker-Attributed ASR (SA-ASR) system in real-life scenarios.
We propose a pipeline tailored to real-life applications involving Voice Activity Detection (VAD), Speaker Diarization (SD), and SA-ASR.
- Score: 18.151884620928936
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Past studies on end-to-end meeting transcription have focused on model architecture and have mostly been evaluated on simulated meeting data. We present a novel study aiming to optimize the use of a Speaker-Attributed ASR (SA-ASR) system in real-life scenarios, such as the AMI meeting corpus, for improved speaker assignment of speech segments. First, we propose a pipeline tailored to real-life applications involving Voice Activity Detection (VAD), Speaker Diarization (SD), and SA-ASR. Second, we advocate using VAD output segments to fine-tune the SA-ASR model, considering that it is also applied to VAD segments during test, and show that this results in a relative reduction of Speaker Error Rate (SER) up to 28%. Finally, we explore strategies to enhance the extraction of the speaker embedding templates used as inputs by the SA-ASR system. We show that extracting them from SD output rather than annotated speaker segments results in a relative SER reduction up to 20%.
Related papers
- Exploring the Integration of Speech Separation and Recognition with
Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end.
We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model.
A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z) - Improving Target Speaker Extraction with Sparse LDA-transformed Speaker
Embeddings [5.4878772986187565]
We propose a simplified speaker cue with clear class separability for target speaker extraction.
Our proposal shows up to 9.9% relative improvement in SI-SDRi.
With SI-SDRi of 19.4 dB and PESQ of 3.78, our best TSE system significantly outperforms the current SOTA systems.
arXiv Detail & Related papers (2023-01-16T06:30:48Z) - An Experimental Study on Private Aggregation of Teacher Ensemble
Learning for End-to-End Speech Recognition [51.232523987916636]
Differential privacy (DP) is one data protection avenue to safeguard user information used for training deep models by imposing noisy distortion on privacy data.
In this work, we extend PATE learning to work with dynamic patterns, namely speech, and perform one very first experimental study on ASR to avoid acoustic data leakage.
arXiv Detail & Related papers (2022-10-11T16:55:54Z) - A Comparative Study on Speaker-attributed Automatic Speech Recognition
in Multi-party Meetings [53.120885867427305]
Three approaches are evaluated for speaker-attributed automatic speech recognition (SA-ASR) in a meeting scenario.
The WD-SOT approach achieves 10.7% relative reduction on averaged speaker-dependent character error rate (SD-CER)
The TS-ASR approach also outperforms the FD-SOT approach and brings 16.5% relative average SD-CER reduction.
arXiv Detail & Related papers (2022-03-31T06:39:14Z) - Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number
of Speakers using End-to-End Speaker-Attributed ASR [44.181755224118696]
Transcribe-to-Diarize is a new approach for neural speaker diarization that uses an end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR)
The proposed method achieves significantly better diarization error rate than various existing speaker diarization methods when the number of speakers is unknown.
arXiv Detail & Related papers (2021-10-07T02:48:49Z) - A Comparative Study of Modular and Joint Approaches for
Speaker-Attributed ASR on Monaural Long-Form Audio [45.04646762560459]
Speaker-attributed automatic speech recognition (SA-ASR) is a task to recognize "who spoke what" from multi-talker recordings.
Considering the joint optimization, an end-to-end (E2E) SA-ASR model has recently been proposed with promising results on simulation data.
We present our recent study on the comparison of such modular and joint approaches towards SA-ASR on real monaural recordings.
arXiv Detail & Related papers (2021-07-06T19:36:48Z) - Self-supervised Text-independent Speaker Verification using Prototypical
Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning.
We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z) - Minimum Bayes Risk Training for End-to-End Speaker-Attributed ASR [39.36608236418025]
We propose a speaker-attributed minimum Bayes risk (SA-MBR) training method to minimize the speaker-attributed word error rate (SA-WER) over the training data.
Experiments using the LibriSpeech corpus show that the proposed SA-MBR training reduces the SA-WER by 9.0 % relative compared with the SA-MMI-trained model.
arXiv Detail & Related papers (2020-11-03T22:28:57Z) - Investigation of End-To-End Speaker-Attributed ASR for Continuous
Multi-Talker Recordings [40.99930744000231]
We extend the prior work by addressing the case where no speaker profile is available.
We perform speaker counting and clustering by using the internal speaker representations of the E2E SA-ASR model.
We also propose a simple modification to the reference labels of the E2E SA-ASR training which helps handle continuous multi-talker recordings well.
arXiv Detail & Related papers (2020-08-11T06:41:55Z) - Target-Speaker Voice Activity Detection: a Novel Approach for
Multi-Speaker Diarization in a Dinner Party Scenario [51.50631198081903]
We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach.
TS-VAD directly predicts an activity of each speaker on each time frame.
Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results.
arXiv Detail & Related papers (2020-05-14T21:24:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.