BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR
- URL: http://arxiv.org/abs/2305.13716v3
- Date: Thu, 5 Oct 2023 11:44:39 GMT
- Title: BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR
- Authors: Yuhao Liang, Fan Yu, Yangze Li, Pengcheng Guo, Shiliang Zhang, Qian
Chen, Lei Xie
- Abstract summary: Frequent speaker changes can make speaker change prediction difficult.
We propose boundary-aware serialized output training (BA-SOT)
Compared to original SOT, BA-SOT reduces CER/UD-CER by 5.1%/14.0%.
- Score: 54.23941663326509
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The recently proposed serialized output training (SOT) simplifies
multi-talker automatic speech recognition (ASR) by generating speaker
transcriptions separated by a special token. However, frequent speaker changes
can make speaker change prediction difficult. To address this, we propose
boundary-aware serialized output training (BA-SOT), which explicitly
incorporates boundary knowledge into the decoder via a speaker change detection
task and boundary constraint loss. We also introduce a two-stage connectionist
temporal classification (CTC) strategy that incorporates token-level SOT CTC to
restore temporal context information. Besides typical character error rate
(CER), we introduce utterance-dependent character error rate (UD-CER) to
further measure the precision of speaker change prediction. Compared to
original SOT, BA-SOT reduces CER/UD-CER by 5.1%/14.0%, and leveraging a
pre-trained ASR model for BA-SOT model initialization further reduces
CER/UD-CER by 8.4%/19.9%.
Related papers
- SEAL: Speaker Error Correction using Acoustic-conditioned Large Language Models [15.098665255729507]
We introduce a novel acoustic conditioning approach to provide more fine-grained information from the acoustic diarizer to the LLM.
Our approach significantly reduces the speaker error rates by 24-43% across Fisher, Callhome, and RT03-CTS datasets.
arXiv Detail & Related papers (2025-01-14T20:24:12Z) - Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition [18.50957174600796]
Solution to automatic speech recognition (ASR) of overlapping speakers is to separate speech and then perform ASR on the separated signals.
Currently, the separator produces artefacts which often degrade ASR performance.
This paper proposes a transcription-free method for joint training using only audio signals.
arXiv Detail & Related papers (2024-06-13T08:20:58Z) - Augmenting Transformer-Transducer Based Speaker Change Detection With
Token-Level Training Loss [15.304831835680847]
We propose a novel token-based training strategy that improves Transformer-Transducer (T-T) based speaker change detection (SCD) performance.
Due to the sparsity of the speaker changes in the training data, the conventional T-T based SCD model loss leads to sub-optimal detection accuracy.
arXiv Detail & Related papers (2022-11-11T21:09:58Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - End-to-end contextual asr based on posterior distribution adaptation for
hybrid ctc/attention system [61.148549738631814]
End-to-end (E2E) speech recognition architectures assemble all components of traditional speech recognition system into a single model.
Although it simplifies ASR system, it introduces contextual ASR drawback: the E2E model has worse performance on utterances containing infrequent proper nouns.
We propose to add a contextual bias attention (CBA) module to attention based encoder decoder (AED) model to improve its ability of recognizing the contextual phrases.
arXiv Detail & Related papers (2022-02-18T03:26:02Z) - ASR-Aware End-to-end Neural Diarization [15.172086811068962]
We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model.
Three modifications to the Conformer-based EEND architecture are proposed to incorporate the features.
Experiments on the two-speaker English conversations of Switchboard+SRE data sets show that multi-task learning with position-in-word information is the most effective way of utilizing ASR features.
arXiv Detail & Related papers (2022-02-02T21:17:14Z) - Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z) - Adapting End-to-End Speech Recognition for Readable Subtitles [15.525314212209562]
In some use cases such as subtitling, verbatim transcription would reduce output readability given limited screen size and reading time.
We first investigate a cascaded system, where an unsupervised compression model is used to post-edit the transcribed speech.
Experiments show that with limited data far less than needed for training a model from scratch, we can adapt a Transformer-based ASR model to incorporate both transcription and compression capabilities.
arXiv Detail & Related papers (2020-05-25T14:42:26Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.