Related papers: Disentangling Speakers in Multi-Talker Speech Recognition with Speaker-Aware CTC

Disentangling Speakers in Multi-Talker Speech Recognition with Speaker-Aware CTC

URL: http://arxiv.org/abs/2409.12388v1
Date: Thu, 19 Sep 2024 01:26:33 GMT
Title: Disentangling Speakers in Multi-Talker Speech Recognition with Speaker-Aware CTC
Authors: Jiawen Kang, Lingwei Meng, Mingyu Cui, Yuejiao Wang, Xixin Wu, Xunying Liu, Helen Meng,
Abstract summary: Multi-talker speech recognition faces unique challenges in disentangling and transcribing overlapping speech. This paper investigates the role of Connectionist Temporal Classification (CTC) in speaker disentanglement when incorporated with Serialized Output Training (SOT) for MTASR. We propose a novel Speaker-Aware CTC (SACTC) training objective, based on the Bayes risk CTC framework.
Score: 73.23245793460275
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-talker speech recognition (MTASR) faces unique challenges in disentangling and transcribing overlapping speech. To address these challenges, this paper investigates the role of Connectionist Temporal Classification (CTC) in speaker disentanglement when incorporated with Serialized Output Training (SOT) for MTASR. Our visualization reveals that CTC guides the encoder to represent different speakers in distinct temporal regions of acoustic embeddings. Leveraging this insight, we propose a novel Speaker-Aware CTC (SACTC) training objective, based on the Bayes risk CTC framework. SACTC is a tailored CTC variant for multi-talker scenarios, it explicitly models speaker disentanglement by constraining the encoder to represent different speakers' tokens at specific time frames. When integrated with SOT, the SOT-SACTC model consistently outperforms standard SOT-CTC across various degrees of speech overlap. Specifically, we observe relative word error rate reductions of 10% overall and 15% on low-overlap speech. This work represents an initial exploration of CTC-based enhancements for MTASR tasks, offering a new perspective on speaker disentanglement in multi-talker speech recognition.

Related papers

SC-SOT: Conditioning the Decoder on Diarized Speaker Information for End-to-End Overlapped Speech Recognition [11.157709125869593]
We propose Speaker-Conditioned Serialized Output Training (SC-SOT) for E2E multi-talker ASR.<n>SC-SOT explicitly conditions the decoder on speaker information, providing detailed information about "who spoke when"
arXiv Detail & Related papers (2025-06-15T00:37:27Z)
Speaker-Distinguishable CTC: Learning Speaker Distinction Using CTC for Multi-Talker Speech Recognition [8.775527128005136]
This paper presents a novel framework for multi-talker automatic speech recognition without the need for auxiliary information.<n>Speaker-Distinguishable CTC (SD-CTC) is an extension of CTC that jointly assigns a token and its corresponding speaker label to each frame.<n>We show that multi-task learning with SD-CTC and SOT reduces the error rate of the SOT model by 26% and achieves performance comparable to state-of-the-art methods relying on auxiliary information.
arXiv Detail & Related papers (2025-06-09T07:43:43Z)
DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding [51.32965203977845]
We propose the use of discrete speech units (DSU) instead of continuous-valued speech encoder outputs. The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering. Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
arXiv Detail & Related papers (2024-06-13T17:28:13Z)
Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker. Network addresses limitations of SIMO models by aggregating cross-speaker representations. Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z)
End-to-End Single-Channel Speaker-Turn Aware Conversational Speech Translation [23.895122319920997]
We tackle single-channel multi-speaker conversational ST with an end-to-end and multi-task training model. Speaker-Turn Aware Conversational Speech Translation combines automatic speech recognition, speech translation and speaker turn detection. We show that our model outperforms the reference systems on the multi-speaker condition, while attaining comparable performance on the single-speaker condition.
arXiv Detail & Related papers (2023-11-01T17:55:09Z)
CTC Alignments Improve Autoregressive Translation [145.90587287444976]
We argue that CTC does in fact make sense for translation if applied in a joint CTC/attention framework. Our proposed joint CTC/attention models outperform pure-attention baselines across six benchmark translation tasks.
arXiv Detail & Related papers (2022-10-11T07:13:50Z)
Improved Relation Networks for End-to-End Speaker Verification and Identification [0.0]
Speaker identification systems are tasked to identify a speaker amongst a set of enrolled speakers given just a few samples. We propose improved relation networks for speaker verification and few-shot (unseen) speaker identification. Inspired by the use of prototypical networks in speaker verification, we train the model to classify samples in the current episode amongst all speakers present in the training set.
arXiv Detail & Related papers (2022-03-31T17:44:04Z)
Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what" Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion. The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z)
Streaming Multi-talker Speech Recognition with Joint Speaker Identification [77.46617674133556]
SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification. We validate our idea on the Librispeech dataset -- a multi-talker dataset derived from Librispeech, and present encouraging results.
arXiv Detail & Related papers (2021-04-05T18:37:33Z)
Content-Aware Speaker Embeddings for Speaker Diarisation [3.6398652091809987]
The content-aware speaker embeddings (CASE) approach is proposed. Case factorises automatic speech recognition (ASR) from speaker recognition to focus on modelling speaker characteristics. Case achieved a 17.8% relative speaker error rate reduction over conventional methods.
arXiv Detail & Related papers (2021-02-12T12:02:03Z)
Intermediate Loss Regularization for CTC-based Speech Recognition [58.33721897180646]
We present a simple and efficient auxiliary loss function for automatic speech recognition (ASR) based on the connectionist temporal classification ( CTC) objective. We evaluate the proposed method on various corpora, reaching word error rate (WER) 9.9% on the WSJ corpus and character error rate (CER) 5.2% on the AISHELL-1 corpus respectively.
arXiv Detail & Related papers (2021-02-05T15:01:03Z)
Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of Any Number of Speakers [38.3469744871394]
We propose an end-to-end speaker-attributed automatic speech recognition model. It unifies speaker counting, speech recognition, and speaker identification on overlapped speech.
arXiv Detail & Related papers (2020-06-19T02:05:18Z)
NAUTILUS: a Versatile Voice Cloning System [44.700803634034486]
NAUTILUS can generate speech with a target voice either from a text input or a reference utterance of an arbitrary source speaker. It can clone unseen voices using untranscribed speech of target speakers on the basis of the backpropagation algorithm. It achieves comparable quality with state-of-the-art TTS and VC systems when cloning with just five minutes of untranscribed speech.
arXiv Detail & Related papers (2020-05-22T05:00:20Z)
Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS) A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation. We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.