Related papers: Serialized Output Prompting for Large Language Model-based Multi-Talker Speech Recognition

Serialized Output Prompting for Large Language Model-based Multi-Talker Speech Recognition

URL: http://arxiv.org/abs/2509.04488v1
Date: Mon, 01 Sep 2025 03:10:14 GMT
Title: Serialized Output Prompting for Large Language Model-based Multi-Talker Speech Recognition
Authors: Hao Shi, Yusuke Fujita, Tomoya Mizumoto, Lianbo Liu, Atsushi Kojima, Yui Sudo,
Abstract summary: Existing multi-talker (MT) automatic speech recognition (ASR) systems either omit prompts or rely on simple task-definition prompts.<n>We propose extracting serialized output prompts ( SOP) and explicitly guiding the LLM using structured prompts to improve system performance.<n>The proposed SOP approach significantly improved performance under both two- and three-talker conditions.
Score: 27.16462631523899
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Prompts are crucial for task definition and for improving the performance of large language models (LLM)-based systems. However, existing LLM-based multi-talker (MT) automatic speech recognition (ASR) systems either omit prompts or rely on simple task-definition prompts, with no prior work exploring the design of prompts to enhance performance. In this paper, we propose extracting serialized output prompts (SOP) and explicitly guiding the LLM using structured prompts to improve system performance (SOP-MT-ASR). A Separator and serialized Connectionist Temporal Classification (CTC) layers are inserted after the speech encoder to separate and extract MT content from the mixed speech encoding in a first-speaking-first-out manner. Subsequently, the SOP, which serves as a prompt for LLMs, is obtained by decoding the serialized CTC outputs using greedy search. To train the model effectively, we design a three-stage training strategy, consisting of serialized output training (SOT) fine-tuning, serialized speech information extraction, and SOP-based adaptation. Experimental results on the LibriMix dataset show that, although the LLM-based SOT model performs well in the two-talker scenario, it fails to fully leverage LLMs under more complex conditions, such as the three-talker scenario. The proposed SOP approach significantly improved performance under both two- and three-talker conditions.

Related papers

DICE: Structured Reasoning in LLMs through SLM-Guided Chain-of-Thought Correction [29.22321321753093]
Large language models (LLMs) often prioritize reasoning over adherence to detailed instructions.<n>Fine-tuning LLMs on supervised datasets to address this is impractical due to high computational costs and limited parameter access.<n>We propose DICE, a lightweight framework that guides small language models (SLMs) to refine LLMs' outputs through chain-of-thought (CoT) correction.
arXiv Detail & Related papers (2025-10-10T09:45:35Z)
MLLM-Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval [50.062817677022586]
Zero-Shot Image Retrieval (ZS-CIR) methods typically train adapters that convert reference images into pseudo-text tokens.<n>We propose MLLM-Guided VLM Fine-Tuning with Joint Inference (MVFT-JI) to construct two complementary training tasks using only unlabeled images.
arXiv Detail & Related papers (2025-05-26T08:56:59Z)
LegoSLM: Connecting LLM with Speech Encoder using CTC Posteriors [22.845623101142483]
We propose a new paradigm, LegoSLM, that bridges speech encoders and Large Language Models (LLMs)<n>Using the well-performing USM and Gemma models as an example, we demonstrate that our proposed LegoSLM method yields good performance on both ASR and speech translation tasks.
arXiv Detail & Related papers (2025-05-16T15:15:19Z)
Retrieval-Enhanced Few-Shot Prompting for Speech Event Extraction [0.0]
Speech Event Extraction (SpeechEE) is a challenging task that lies at the intersection of Automatic Speech Recognition (ASR) and Natural Language Processing (NLP)<n>We present a modular, pipeline-based SpeechEE framework that integrates high-performance ASR with semantic search-enhanced prompting of Large Language Models (LLMs)<n>Our results demonstrate that pipeline approaches, when empowered by retrieval-augmented LLMs, can rival or exceed end-to-end systems.
arXiv Detail & Related papers (2025-04-30T07:10:10Z)
LLM-AutoDiff: Auto-Differentiate Any LLM Workflow [58.56731133392544]
We introduce LLM-AutoDiff: a novel framework for Automatic Prompt Engineering (APE)<n>LLMs-AutoDiff treats each textual input as a trainable parameter and uses a frozen backward engine to generate feedback-akin to textual gradients.<n>It consistently outperforms existing textual gradient baselines in both accuracy and training cost.
arXiv Detail & Related papers (2025-01-28T03:18:48Z)
Advancing Multi-talker ASR Performance with Large Language Models [48.52252970956368]
Recognizing overlapping speech from multiple speakers in conversational scenarios is one of the most challenging problem for automatic speech recognition (ASR) In this paper, we propose an LLM-based SOT approach for multi-talker ASR, leveraging pre-trained speech encoder and LLM. Our approach surpasses traditional AED-based methods on the simulated dataset LibriMix and achieves state-of-the-art performance on the evaluation set of the real-world dataset AMI.
arXiv Detail & Related papers (2024-08-30T17:29:25Z)
LARR: Large Language Model Aided Real-time Scene Recommendation with Semantic Understanding [19.510385758079966]
Large Language Model Aided Real-time Scene Recommendation(LARR) This paper introduces Large Language Model Aided Real-time Scene Recommendation(LARR)
arXiv Detail & Related papers (2024-08-21T10:56:26Z)
Prompting Large Language Models with Audio for General-Purpose Speech Summarization [13.415189715216354]
We introduce a framework for speech summarization that leverages the processing and reasoning capabilities of large language models (LLMs) We propose an end-to-end system that combines an instruction-tuned LLM with an audio encoder that converts speech into token representations that the LLM can interpret.
arXiv Detail & Related papers (2024-06-10T02:04:28Z)
Speech Translation with Large Language Models: An Industrial Practice [64.5419534101104]
We introduce LLM-ST, a novel and effective speech translation model constructed upon a pre-trained large language model (LLM) By integrating the large language model (LLM) with a speech encoder and employing multi-task instruction tuning, LLM-ST can produce accurate timestamped transcriptions and translations. Through rigorous experimentation on English and Chinese datasets, we showcase the exceptional performance of LLM-ST.
arXiv Detail & Related papers (2023-12-21T05:32:49Z)
Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model in End-to-End Speech Recognition [23.172469312225694]
We propose to utilize an instruction-tuned large language model (LLM) for guiding the text generation process in automatic speech recognition (ASR)<n>The proposed model is built on the joint CTC and attention architecture, with the LLM serving as a front-end feature extractor for the decoder.<n> Experimental results show that the proposed LLM-guided model achieves a relative gain of approximately 13% in word error rates across major benchmarks.
arXiv Detail & Related papers (2023-09-19T11:10:50Z)
LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset. Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction [56.790794611002106]
Large language models (LLMs) have demonstrated remarkable results in various natural language processing (NLP) tasks with in-context learning. We propose a simple but effective in-context learning framework called ICL-D3IE. Specifically, we extract the most difficult and distinct segments from hard training documents as hard demonstrations.
arXiv Detail & Related papers (2023-03-09T06:24:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.