Related papers: CMT-LLM: Contextual Multi-Talker ASR Utilizing Large Language Models

CMT-LLM: Contextual Multi-Talker ASR Utilizing Large Language Models

URL: http://arxiv.org/abs/2506.12059v1
Date: Sat, 31 May 2025 07:26:44 GMT
Title: CMT-LLM: Contextual Multi-Talker ASR Utilizing Large Language Models
Authors: Jiajun He, Naoki Sawada, Koichi Miyazaki, Tomoki Toda,
Abstract summary: We propose a unified framework that combines multi-talker overlapping speech recognition and contextual biasing into a single task.<n>Our approach outperforms traditional contextual biasing methods, achieving a WER of 7.9% on LibriMix and 32.9% on AMI SDM.
Score: 23.278483193586887
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In real-world applications, automatic speech recognition (ASR) systems must handle overlapping speech from multiple speakers and recognize rare words like technical terms. Traditional methods address multi-talker ASR and contextual biasing separately, limiting performance in complex scenarios. We propose a unified framework that combines multi-talker overlapping speech recognition and contextual biasing into a single task. Our ASR method integrates pretrained speech encoders and large language models (LLMs), using optimized finetuning strategies. We also introduce a two-stage filtering algorithm to efficiently identify relevant rare words from large biasing lists and incorporate them into the LLM's prompt input, enhancing rare word recognition. Experiments show that our approach outperforms traditional contextual biasing methods, achieving a WER of 7.9% on LibriMix and 32.9% on AMI SDM when the biasing size is 1,000, demonstrating its effectiveness in complex speech scenarios.

Related papers

SEAL: Speech Embedding Alignment Learning for Speech Large Language Model with Retrieval-Augmented Generation [10.828717295018123]
We propose a unified embedding framework that eliminates the need for intermediate text representations.<n>Our model reduces pipeline latency by 50% while achieving higher retrieval accuracy compared to traditional two-stage methods.
arXiv Detail & Related papers (2025-01-26T15:04:02Z)
MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models [59.80042864360884]
Speaker-attributed automatic speech recognition (SA-ASR) aims to transcribe speech while assigning transcripts to the corresponding speakers accurately.<n>This paper introduces a novel approach, leveraging a frozen multilingual ASR model to incorporate speaker attribution into the transcriptions.
arXiv Detail & Related papers (2024-11-27T09:01:08Z)
Large Language Models are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.<n>We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.<n>We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.79% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z)
Advancing Multi-talker ASR Performance with Large Language Models [48.52252970956368]
Recognizing overlapping speech from multiple speakers in conversational scenarios is one of the most challenging problem for automatic speech recognition (ASR) In this paper, we propose an LLM-based SOT approach for multi-talker ASR, leveraging pre-trained speech encoder and LLM. Our approach surpasses traditional AED-based methods on the simulated dataset LibriMix and achieves state-of-the-art performance on the evaluation set of the real-world dataset AMI.
arXiv Detail & Related papers (2024-08-30T17:29:25Z)
Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System [73.34663391495616]
We propose a pioneering approach to tackle joint multi-talker and target-talker speech recognition tasks. Specifically, we freeze Whisper and plug a Sidecar separator into its encoder to separate mixed embedding for multiple talkers. We deliver acceptable zero-shot performance on multi-talker ASR on AishellMix Mandarin dataset.
arXiv Detail & Related papers (2024-07-13T09:28:24Z)
OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality Alignment [57.15449072423539]
We propose a training system Open-modality Speech Recognition (textbfOpenSR) OpenSR enables modality transfer from one to any in three different settings. It achieves highly competitive zero-shot performance compared to the existing few-shot and full-shot lip-reading methods.
arXiv Detail & Related papers (2023-06-10T11:04:10Z)
Unified Modeling of Multi-Talker Overlapped Speech Recognition and Diarization with a Sidecar Separator [42.8787280791491]
Multi-talker overlapped speech poses a significant challenge for speech recognition and diarization. We propose a cost-effective method to convert a single-talker automatic speech recognition system into a multi-talker one. We incorporate a diarization branch into the Sidecar, allowing for unified modeling of both ASR and diarization with a negligible overhead of only 768 parameters.
arXiv Detail & Related papers (2023-05-25T17:18:37Z)
Simulating realistic speech overlaps improves multi-talker ASR [36.39193360559079]
We propose an improved technique to simulate multi-talker overlapping speech with realistic speech overlaps. With this representation, speech overlapping patterns can be learned from real conversations based on a statistical language model, such as N-gram. In our experiments, multi-talker ASR models trained with the proposed method show consistent improvement on the word error rates across multiple datasets.
arXiv Detail & Related papers (2022-10-27T18:29:39Z)
Multi-task Language Modeling for Improving Speech Recognition of Rare Words [14.745696312889763]
We propose a second-pass system with multi-task learning, utilizing semantic targets (such as intent and slot prediction) to improve speech recognition performance. Our best ASR system with multi-task LM shows 4.6% WERR deduction compared with RNN Transducer only ASR baseline for rare words recognition.
arXiv Detail & Related papers (2020-11-23T20:40:44Z)
Multi-talker ASR for an unknown number of sources: Joint training of source counting, separation and ASR [91.87500543591945]
We develop an end-to-end multi-talker automatic speech recognition system for an unknown number of active speakers. Our experiments show very promising performance in counting accuracy, source separation and speech recognition. Our system generalizes well to a larger number of speakers than it ever saw during training.
arXiv Detail & Related papers (2020-06-04T11:25:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.