Boosting CTC-Based ASR Using LLM-Based Intermediate Loss Regularization
- URL: http://arxiv.org/abs/2506.22846v1
- Date: Sat, 28 Jun 2025 10:59:42 GMT
- Title: Boosting CTC-Based ASR Using LLM-Based Intermediate Loss Regularization
- Authors: Duygu Altinok,
- Abstract summary: We propose a novel auxiliary loss framework called Language-Aware Intermediate Loss (LAIL) to enhance CTC-based automatic speech recognition systems.<n>LAIL maps outputs to the embedding space of large language models (LLMs) and computes a causal language modeling loss during training.<n>We demonstrate significant improvements in Word Error Rate (WER) on the LibriSpeech, TEDLIUM2, and WSJ corpora, achieving state-of-the-art performance for CTC-based ASR with minimal computational overhead.
- Score: 5.439020425819001
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: End-to-end (E2E) automatic speech recognition (ASR) systems have revolutionized the field by integrating all components into a single neural network, with attention-based encoder-decoder models achieving state-of-the-art performance. However, their autoregressive decoding process limits inference speed, making them unsuitable for real-time applications. In contrast, CTC-based models offer faster, non-autoregressive decoding but struggle to model linguistic dependencies effectively. Addressing this challenge, we propose a novel auxiliary loss framework called Language-Aware Intermediate Loss (LAIL) to enhance CTC-based ASR using the linguistic knowledge of large language models (LLMs). By attaching connector layers to intermediate encoder layers, LAIL maps outputs to the embedding space of an LLM and computes a causal language modeling loss during training. This approach enhances linguistic modeling while preserving the computational efficiency of CTC decoding. Using the Conformer architecture and various LLaMA models, we demonstrate significant improvements in Word Error Rate (WER) on the LibriSpeech, TEDLIUM2, and WSJ corpora, achieving state-of-the-art performance for CTC-based ASR with minimal computational overhead.
Related papers
- Large-Scale Model Enabled Semantic Communication Based on Robust Knowledge Distillation [53.16213723669751]
Large-scale models (LSMs) can be an effective framework for semantic representation and understanding.<n>However, their direct deployment is often hindered by high computational complexity and resource requirements.<n>This paper proposes a novel knowledge distillation based semantic communication framework.
arXiv Detail & Related papers (2025-08-04T07:47:18Z) - R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning [60.37610817226533]
Chain-of-thought (CoT) reasoning encourages step-by-step intermediate reasoning during inference.<n>CoT introduces substantial computational overhead due to its reliance on autoregressive decoding over long token sequences.<n>We present R-Stitch, a token-level, confidence-based hybrid decoding framework that accelerates CoT inference.
arXiv Detail & Related papers (2025-07-23T08:14:36Z) - NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding [54.88765757043535]
This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference.<n>Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types with less than 7% computational overhead.<n>The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search.
arXiv Detail & Related papers (2025-05-28T20:43:10Z) - Efficient Knowledge Feeding to Language Models: A Novel Integrated Encoder-Decoder Architecture [0.0]
ICV recasts in-context learning by using latent embeddings of language models.<n>ICV directly integrates information into the model, enabling it to process this information more effectively.
arXiv Detail & Related papers (2025-02-07T04:24:07Z) - CTC-Assisted LLM-Based Contextual ASR [40.6542391788212]
We propose a CTC-Assisted LLM-Based Contextual ASR model with an efficient filtering algorithm.
Our model attains WER/B-WER of 1.27%/3.67% and 2.72%/8.02% on the Librispeech test-clean and test-other sets targeting on recognizing rare long-tail words.
arXiv Detail & Related papers (2024-11-10T11:47:50Z) - E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning [20.660297311025417]
We introduce E2LLM (Encodergated Large Language Models), a novel approach that effectively navigates the "impossible triangle"
The method involves splitting long contexts into chunks, compressing each into embedding vectors via a pretrained text encoder, and utilizing an adapter to align these representations with a decoder-only LLM.
Experimental results demonstrate that E2LLM achieves superior performance in long-context scenarios while balancing efficiency, performance, and compatibility with pretrained models.
arXiv Detail & Related papers (2024-09-10T17:44:35Z) - PRoDeliberation: Parallel Robust Deliberation for End-to-End Spoken Language Understanding [44.77985942208969]
PRoDeliberation is a novel method leveraging a Connectionist Temporal Classification-based decoding strategy as well as a denoising objective to train robust non-autoregressive deliberation models.
We show that PRoDeliberation achieves the latency reduction of parallel decoding (2-10x improvement over autoregressive models) while retaining the ability to correct Automatic Speech Recognition (ASR) mistranscriptions.
arXiv Detail & Related papers (2024-06-12T02:46:17Z) - Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs [57.27982780697922]
Large language models have demonstrated exceptional capability in natural language understanding and generation.
However, their generation speed is limited by the inherently sequential nature of their decoding process.
This paper introduces Lexical Unit Decoding, a novel decoding methodology implemented in a data-driven manner.
arXiv Detail & Related papers (2024-05-24T04:35:13Z) - Speculative Contrastive Decoding [55.378200871224074]
Large language models(LLMs) exhibit exceptional performance in language tasks, yet their auto-regressive inference is limited due to high computational requirements and is sub-optimal due to the exposure bias.
Inspired by speculative decoding and contrastive decoding, we introduce Speculative Contrastive Decoding(SCD), a straightforward yet powerful decoding approach.
arXiv Detail & Related papers (2023-11-15T14:15:30Z) - A Lexical-aware Non-autoregressive Transformer-based ASR Model [9.500518278458905]
We propose a lexical-aware non-autoregressive Transformer-based (LA-NAT) ASR framework, which consists of an acoustic encoder, a speech-text shared encoder, and a speech-text shared decoder.
LA-NAT aims to make the ASR model aware of lexical information, so the resulting model is expected to achieve better results by leveraging the learned linguistic knowledge.
arXiv Detail & Related papers (2023-05-18T09:50:47Z) - Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with
Non-Autoregressive Hidden Intermediates [59.678108707409606]
We propose Fast-MD, a fast MD model that generates HI by non-autoregressive decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder.
Fast-MD achieved about 2x and 4x faster decoding speed than that of the na"ive MD model on GPU and CPU with comparable translation quality.
arXiv Detail & Related papers (2021-09-27T05:21:30Z) - Non-autoregressive End-to-end Speech Translation with Parallel
Autoregressive Rescoring [83.32560748324667]
This article describes an efficient end-to-end speech translation (E2E-ST) framework based on non-autoregressive (NAR) models.
We propose a unified NAR E2E-ST framework called Orthros, which has an NAR decoder and an auxiliary shallow AR decoder on top of the shared encoder.
arXiv Detail & Related papers (2021-09-09T16:50:16Z) - Intermediate Loss Regularization for CTC-based Speech Recognition [58.33721897180646]
We present a simple and efficient auxiliary loss function for automatic speech recognition (ASR) based on the connectionist temporal classification ( CTC) objective.
We evaluate the proposed method on various corpora, reaching word error rate (WER) 9.9% on the WSJ corpus and character error rate (CER) 5.2% on the AISHELL-1 corpus respectively.
arXiv Detail & Related papers (2021-02-05T15:01:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.