Related papers: Improving Practical Aspects of End-to-End Multi-Talker Speech Recognition for Online and Offline Scenarios

Improving Practical Aspects of End-to-End Multi-Talker Speech Recognition for Online and Offline Scenarios

URL: http://arxiv.org/abs/2506.14204v1
Date: Tue, 17 Jun 2025 05:46:38 GMT
Title: Improving Practical Aspects of End-to-End Multi-Talker Speech Recognition for Online and Offline Scenarios
Authors: Aswin Shanmugam Subramanian, Amit Das, Naoyuki Kanda, Jinyu Li, Xiaofei Wang, Yifan Gong,
Abstract summary: Serialized Output Training (SOT) addresses practical needs of both streaming and offline automatic speech recognition (ASR) applications.<n>Our approach focuses on balancing latency and accuracy, catering to real-time captioning and summarization requirements.
Score: 33.271537268488316
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We extend the frameworks of Serialized Output Training (SOT) to address practical needs of both streaming and offline automatic speech recognition (ASR) applications. Our approach focuses on balancing latency and accuracy, catering to real-time captioning and summarization requirements. We propose several key improvements: (1) Leveraging Continuous Speech Separation (CSS) single-channel front-end with end-to-end (E2E) systems for highly overlapping scenarios, challenging the conventional wisdom of E2E versus cascaded setups. The CSS framework improves the accuracy of the ASR system by separating overlapped speech from multiple speakers. (2) Implementing dual models -- Conformer Transducer for streaming and Sequence-to-Sequence for offline -- or alternatively, a two-pass model based on cascaded encoders. (3) Exploring segment-based SOT (segSOT) which is better suited for offline scenarios while also enhancing readability of multi-talker transcriptions.

Related papers

Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems [31.911085541071028]
We propose a low-latency architecture that enables listen-while-thinking and speak-while-thinking.<n>Experiments on two spoken dialogue benchmarks demonstrate that DDTSR reduces response latency by 19%-51%.
arXiv Detail & Related papers (2026-02-26T17:39:56Z)
Optimizing Conversational Quality in Spoken Dialogue Systems with Reinforcement Learning from AI Feedback [82.70507055599093]
We present the first systematic study of preference learning for improving SDS quality in both multi-turn Chain-of-Thought and blockwise duplex models.<n>Experiments show that single-reward RLAIF selectively improves its targeted metric, while joint multi-reward training yields consistent gains across semantic quality and audio naturalness.
arXiv Detail & Related papers (2026-01-27T00:55:14Z)
TagSpeech: End-to-End Multi-Speaker ASR and Diarization with Fine-Grained Temporal Grounding [15.908533215017059]
We present TagSpeech, a unified framework that utilizes Temporal Anchor Grounding for joint multi-speaker ASR and diarization.<n>The framework is built on two key designs: (1) decoupled semantic and speaker streams fine-tuned via Serialized Output Training (SOT) to learn turn-taking dynamics; and (2) an interleaved time anchor mechanism that acts as a synchronization signal between semantic understanding and speaker tracking.
arXiv Detail & Related papers (2026-01-11T12:40:07Z)
Chain-of-Thought Reasoning in Streaming Full-Duplex End-to-End Spoken Dialogue Systems [82.70507055599093]
We propose a Streaming Chain-of-Thought (CoT) framework for Duplex SDS.<n>We create intermediate targets-aligned user transcripts and system responses for each block.<n>Experiments show that our approach produces more coherent and interpretable responses than existing duplex methods.
arXiv Detail & Related papers (2025-10-02T14:33:05Z)
CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching [79.0241611035794]
CoVoMix2 is a framework for zero-shot multi-talker dialogue generation.<n>It predicts mel-spectrograms from multi-stream transcriptions using a flow-matching-based generative model.<n>Our approach achieves state-of-the-art performance, outperforming strong baselines like MoonCast and Sesame in speech quality, speaker consistency, and inference speed.
arXiv Detail & Related papers (2025-06-01T07:51:45Z)
Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker. Network addresses limitations of SIMO models by aggregating cross-speaker representations. Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z)
Combining TF-GridNet and Mixture Encoder for Continuous Speech Separation for Meeting Transcription [31.774032625780414]
TF-GridNet has shown impressive performance in speech separation in real reverberant conditions.<n>We extend the mixture encoder from a static two-speaker scenario to a natural meeting context.<n>Experiments result in a new state-of-the-art performance on LibriCSS using a single microphone.
arXiv Detail & Related papers (2023-09-15T14:57:28Z)
Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework. First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes. Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z)
VarArray Meets t-SOT: Advancing the State of the Art of Streaming Distant Conversational Speech Recognition [36.580955189182404]
This paper presents a novel streaming automatic speech recognition (ASR) framework for multi-talker overlapping speech captured by a distant microphone array with an arbitrary geometry. Our framework, named t-SOT-VA, capitalizes on independently developed two recent technologies; array-geometry-agnostic continuous speech separation, or VarArray, and streaming multi-talker ASR based on token-level serialized output training (t-SOT) Our system achieves the state-of-the-art word error rates of 13.7% and 15.5% for the AMI development and evaluation sets, respectively, in the multiple-distant
arXiv Detail & Related papers (2022-09-12T01:22:04Z)
TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation. We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices. TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z)
Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU) We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z)
Streaming Multi-Talker ASR with Token-Level Serialized Output Training [53.11450530896623]
t-SOT is a novel framework for streaming multi-talker automatic speech recognition. The t-SOT model has the advantages of less inference cost and a simpler model architecture. For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2022-02-02T01:27:21Z)
Self-Attention Channel Combinator Frontend for End-to-End Multichannel Far-field Speech Recognition [1.0276024900942875]
When a sufficiently large far-field training data is presented, jointly optimizing a multichannel and an end-to-end (E2E) Automatic Speech Recognition (ASR) backend shows promising results. Recent literature has shown traditional beamformer designs, such as MVDR (Minimum Varianceless Response) or fixed beamformers can be successfully integrated into an E2E ASR system with learnable parameters. We propose the self-attention channel Distortionator (SACC) ASR, which leverages the self-attention mechanism to combine multichannel audio signals in the magnitude spectral domain.
arXiv Detail & Related papers (2021-09-10T11:03:43Z)
Two-Stage Augmentation and Adaptive CTC Fusion for Improved Robustness of Multi-Stream End-to-End ASR [35.7018440502825]
In a multi-stream paradigm, improving robustness takes account of handling a variety of unseen single-stream conditions and inter-stream dynamics. We introduce a two-stage augmentation scheme focusing on mismatch scenarios. Compared with the previous training strategy, substantial improvements are reported with relative word error rate reductions of 29.7-59.3%.
arXiv Detail & Related papers (2021-02-05T08:36:58Z)
Streaming end-to-end multi-talker speech recognition [34.76106500736099]
We propose the Streaming Unmixing and Recognition Transducer (SURT) for end-to-end multi-talker speech recognition. Our model employs the Recurrent Neural Network Transducer (RNN-T) as the backbone that can meet various latency constraints. Based on experiments on the publicly available LibriSpeechMix dataset, we show that HEAT can achieve better accuracy compared with PIT.
arXiv Detail & Related papers (2020-11-26T06:28:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.