Related papers: MTLM: Incorporating Bidirectional Text Information to Enhance Language Model Training in Speech Recognition Systems

MTLM: Incorporating Bidirectional Text Information to Enhance Language Model Training in Speech Recognition Systems

URL: http://arxiv.org/abs/2502.10058v2
Date: Sat, 14 Jun 2025 12:43:36 GMT
Title: MTLM: Incorporating Bidirectional Text Information to Enhance Language Model Training in Speech Recognition Systems
Authors: Qingliang Meng, Pengju Ren, Tian Li, Changsong Dai, Huizhi Liang,
Abstract summary: MTLM is a novel training paradigm that unifies unidirectional and bidirectional manners through 3 training objectives.<n>It supports multiple decoding strategies, including shallow fusion, unidirectional/bidirectional n-best rescoring.<n>Experiments on the LibriSpeech dataset show that MTLM consistently outperforms unidirectional training across multiple decoding strategies.
Score: 8.971049629873185
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automatic speech recognition (ASR) systems normally consist of an acoustic model (AM) and a language model (LM). The acoustic model estimates the probability distribution of text given the input speech, while the language model calibrates this distribution toward a specific knowledge domain to produce the final transcription. Traditional ASR-specific LMs are typically trained in a unidirectional (left-to-right) manner to align with autoregressive decoding. However, this restricts the model from leveraging the right-side context during training, limiting its representational capacity. In this work, we propose MTLM, a novel training paradigm that unifies unidirectional and bidirectional manners through 3 training objectives: ULM, BMLM, and UMLM. This approach enhances the LM's ability to capture richer linguistic patterns from both left and right contexts while preserving compatibility with standard ASR autoregressive decoding methods. As a result, the MTLM model not only enhances the ASR system's performance but also support multiple decoding strategies, including shallow fusion, unidirectional/bidirectional n-best rescoring. Experiments on the LibriSpeech dataset show that MTLM consistently outperforms unidirectional training across multiple decoding strategies, highlighting its effectiveness and flexibility in ASR applications.

Related papers

LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization [8.365515332927444]
Recent speech tokenization approaches aim to isolate semantic information from low-level acoustics to better align with language models.<n>We propose LM-SPT, a speech tokenization method that introduces a novel semantic distillation.<n>We show that LM-SPT achieves superior reconstruction fidelity compared to baselines.
arXiv Detail & Related papers (2025-06-20T04:15:14Z)
From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z)
Transferring Textual Preferences to Vision-Language Understanding through Model Merging [65.41765072566287]
This paper explores a training-free alternative by merging text-based reward models (RMs) with large vision-language models (LVLMs)<n>Our approach shows that integrating these models leads to improved performance over LVLMs' scoring and text-based RMs.
arXiv Detail & Related papers (2025-02-19T07:20:07Z)
Transducer-Llama: Integrating LLMs into Streamable Transducer-based Speech Recognition [26.79555533538622]
This paper proposes a novel model architecture, Transducer-Llama, that integrates large language models (LLMs) into a Factorized Transducer (FT) model.<n>The proposed streaming Transducer-Llama approach gave a 17% relative WER reduction (WERR) over a strong FT baseline and a 32% WERR over an RNN-T baseline.
arXiv Detail & Related papers (2024-12-21T03:35:49Z)
DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)<n>We present a simple yet effective automatic process for creating speech-text pair data.<n>Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z)
Large Language Models are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.<n>We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.<n>We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.79% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z)
Advancing Multi-talker ASR Performance with Large Language Models [48.52252970956368]
Recognizing overlapping speech from multiple speakers in conversational scenarios is one of the most challenging problem for automatic speech recognition (ASR) In this paper, we propose an LLM-based SOT approach for multi-talker ASR, leveraging pre-trained speech encoder and LLM. Our approach surpasses traditional AED-based methods on the simulated dataset LibriMix and achieves state-of-the-art performance on the evaluation set of the real-world dataset AMI.
arXiv Detail & Related papers (2024-08-30T17:29:25Z)
Investigating Decoder-only Large Language Models for Speech-to-text Translation [39.17113782374464]
Large language models (LLMs) are known for their exceptional reasoning capabilities, generalizability, and fluency across diverse domains. We propose a decoder-only architecture that enables the LLM to directly consume the encoded speech representation and generate the text translation. Our model achieves state-of-the-art performance on CoVoST 2 and FLEURS among models trained without proprietary data.
arXiv Detail & Related papers (2024-07-03T14:42:49Z)
TasTe: Teaching Large Language Models to Translate through Self-Reflection [82.83958470745381]
Large language models (LLMs) have exhibited remarkable performance in various natural language processing tasks. We propose the TasTe framework, which stands for translating through self-reflection. The evaluation results in four language directions on the WMT22 benchmark reveal the effectiveness of our approach compared to existing methods.
arXiv Detail & Related papers (2024-06-12T17:21:21Z)
DeMPT: Decoding-enhanced Multi-phase Prompt Tuning for Making LLMs Be Better Context-aware Translators [26.665489056201725]
We propose an adaptation approach, named Decoding-enhanced Multi-phase Prompt Tuning (DeMPT) During each phase, different continuous prompts are introduced to make LLMs discriminately model various information. Experiments show that our approach significantly outperforms the concatenation method.
arXiv Detail & Related papers (2024-02-23T09:01:00Z)
An Embarrassingly Simple Approach for LLM with Strong ASR Capacity [56.30595787061546]
We focus on solving one of the most important tasks in the field of speech processing, with speech foundation encoders and large language models (LLM) Recent works have complex designs such as compressing the output temporally for the speech encoder, tackling modal alignment for the projector, and utilizing parameter-efficient fine-tuning for the LLM. We found that delicate designs are not necessary, while an embarrassingly simple composition of off-the-shelf speech encoder, LLM, and the only trainable linear projector is competent for the ASR task.
arXiv Detail & Related papers (2024-02-13T23:25:04Z)
Vocabulary-Defined Semantics: Latent Space Clustering for Improving In-Context Learning [32.178931149612644]
In-context learning enables language models to adapt to downstream data or incorporate tasks by few samples as demonstrations within the prompts. However, the performance of in-context learning can be unstable depending on the quality, format, or order of demonstrations. We propose a novel approach "vocabulary-defined semantics"
arXiv Detail & Related papers (2024-01-29T14:29:48Z)
Towards ASR Robust Spoken Language Understanding Through In-Context Learning With Word Confusion Networks [68.79880423713597]
We introduce a method that utilizes the ASR system's lattice output instead of relying solely on the top hypothesis. Our in-context learning experiments, covering spoken question answering and intent classification, underline the LLM's resilience to noisy speech transcripts.
arXiv Detail & Related papers (2024-01-05T17:58:10Z)
FLIP: Fine-grained Alignment between ID-based Models and Pretrained Language Models for CTR Prediction [49.510163437116645]
Click-through rate (CTR) prediction plays as a core function module in personalized online services. Traditional ID-based models for CTR prediction take as inputs the one-hot encoded ID features of tabular modality. Pretrained Language Models(PLMs) has given rise to another paradigm, which takes as inputs the sentences of textual modality. We propose to conduct Fine-grained feature-level ALignment between ID-based Models and Pretrained Language Models(FLIP) for CTR prediction.
arXiv Detail & Related papers (2023-10-30T11:25:03Z)
Exploring In-Context Learning of Textless Speech Language Model for Speech Classification Tasks [98.5311231450689]
In-context learning (ICL) has played an essential role in utilizing large language models (LLMs) This study is the first work exploring ICL for speech classification tasks with textless speech LM.
arXiv Detail & Related papers (2023-10-19T05:31:45Z)
Interpreting Learned Feedback Patterns in Large Language Models [11.601799960959214]
We train probes to estimate the feedback signal implicit in the activations of a fine-tuned language model. We compare these estimates to the true feedback, measuring how accurate the LFPs are to the fine-tuning feedback. We validate our probes by comparing the neural features they correlate with positive feedback inputs against the features GPT-4 describes and classifies as related to LFPs.
arXiv Detail & Related papers (2023-10-12T09:36:03Z)
An Empirical Study of Language Model Integration for Transducer based Speech Recognition [23.759084092602517]
Methods such as density ratio (DR) and ILM estimation (ILME) have been developed, outperforming the classic shallow fusion (SF) method. We propose a low-order density ratio method (LODR) by training a low-order weak ILM for DR.
arXiv Detail & Related papers (2022-03-31T03:33:50Z)
On Language Model Integration for RNN Transducer based Speech Recognition [49.84285563767935]
We study various ILM correction-based LM integration methods formulated in a common RNN-T framework. We provide a decoding interpretation on two major reasons for performance improvement with ILM correction. We also propose an exact-ILM training framework by extending the proof given in the hybrid autoregressive transducer.
arXiv Detail & Related papers (2021-10-13T16:30:46Z)
Language Model Prior for Low-Resource Neural Machine Translation [85.55729693003829]
We propose a novel approach to incorporate a LM as prior in a neural translation model (TM) We add a regularization term, which pushes the output distributions of the TM to be probable under the LM prior. Results on two low-resource machine translation datasets show clear improvements even with limited monolingual data.
arXiv Detail & Related papers (2020-04-30T16:29:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.