Hypothesis Stitcher for End-to-End Speaker-attributed ASR on Long-form
Multi-talker Recordings
- URL: http://arxiv.org/abs/2101.01853v1
- Date: Wed, 6 Jan 2021 03:36:09 GMT
- Title: Hypothesis Stitcher for End-to-End Speaker-attributed ASR on Long-form
Multi-talker Recordings
- Authors: Xuankai Chang, Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng,
Takuya Yoshioka
- Abstract summary: An end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR) model was proposed recently to jointly perform speaker counting, speech recognition and speaker identification.
The model achieved a low speaker-attributed word error rate (SA-WER) for monaural overlapped speech comprising an unknown number of speakers.
It has yet to be investigated whether the E2E SA-ASR model works well for recordings that are much longer than samples seen during training.
- Score: 42.17790794610591
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: An end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR)
model was proposed recently to jointly perform speaker counting, speech
recognition and speaker identification. The model achieved a low
speaker-attributed word error rate (SA-WER) for monaural overlapped speech
comprising an unknown number of speakers. However, the E2E modeling approach is
susceptible to the mismatch between the training and testing conditions. It has
yet to be investigated whether the E2E SA-ASR model works well for recordings
that are much longer than samples seen during training. In this work, we first
apply a known decoding technique that was developed to perform single-speaker
ASR for long-form audio to our E2E SA-ASR task. Then, we propose a novel method
using a sequence-to-sequence model, called hypothesis stitcher. The model takes
multiple hypotheses obtained from short audio segments that are extracted from
the original long-form input, and it then outputs a fused single hypothesis. We
propose several architectural variations of the hypothesis stitcher model and
compare them with the conventional decoding methods. Experiments using
LibriSpeech and LibriCSS corpora show that the proposed method significantly
improves SA-WER especially for long-form multi-talker recordings.
Related papers
- One model to rule them all ? Towards End-to-End Joint Speaker
Diarization and Speech Recognition [50.055765860343286]
This paper presents a novel framework for joint speaker diarization and automatic speech recognition.
The framework, named SLIDAR, can process arbitrary length inputs and can handle any number of speakers.
Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios.
arXiv Detail & Related papers (2023-10-02T23:03:30Z) - TokenSplit: Using Discrete Speech Representations for Direct, Refined,
and Transcript-Conditioned Speech Separation and Recognition [51.565319173790314]
TokenSplit is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture.
We show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning.
We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
arXiv Detail & Related papers (2023-08-21T01:52:01Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number
of Speakers using End-to-End Speaker-Attributed ASR [44.181755224118696]
Transcribe-to-Diarize is a new approach for neural speaker diarization that uses an end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR)
The proposed method achieves significantly better diarization error rate than various existing speaker diarization methods when the number of speakers is unknown.
arXiv Detail & Related papers (2021-10-07T02:48:49Z) - GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech
Synthesis [6.632254395574993]
GANSpeech is a high-fidelity multi-speaker TTS model that adopts the adversarial training method to a non-autoregressive multi-speaker TTS model.
In the subjective listening tests, GANSpeech significantly outperformed the baseline multi-speaker FastSpeech and FastSpeech2 models.
arXiv Detail & Related papers (2021-06-29T08:15:30Z) - Investigation of End-To-End Speaker-Attributed ASR for Continuous
Multi-Talker Recordings [40.99930744000231]
We extend the prior work by addressing the case where no speaker profile is available.
We perform speaker counting and clustering by using the internal speaker representations of the E2E SA-ASR model.
We also propose a simple modification to the reference labels of the E2E SA-ASR training which helps handle continuous multi-talker recordings well.
arXiv Detail & Related papers (2020-08-11T06:41:55Z) - Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio
Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model.
We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks.
In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z) - Deliberation Model Based Two-Pass End-to-End Speech Recognition [52.45841282906516]
A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model.
The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses.
A bidirectional encoder is used to extract context information from first-pass hypotheses.
arXiv Detail & Related papers (2020-03-17T22:01:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.