Related papers: LSTM-LM with Long-Term History for First-Pass Decoding in Conversational Speech Recognition

LSTM-LM with Long-Term History for First-Pass Decoding in Conversational Speech Recognition

URL: http://arxiv.org/abs/2010.11349v1
Date: Wed, 21 Oct 2020 23:40:26 GMT
Title: LSTM-LM with Long-Term History for First-Pass Decoding in Conversational Speech Recognition
Authors: Xie Chen, Sarangarajan Parthasarathy, William Gale, Shuangyu Chang, Michael Zeng
Abstract summary: LSTM language models (LSTM-LMs) have been proven to be powerful and yielded significant performance improvements over count based n-gram LMs in modern speech recognition systems. Recent work shows that it is feasible and computationally affordable to adopt the LSTM-LMs in the first-pass decoding within a dynamic (or tree based) decoder framework.
Score: 27.639919625398
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: LSTM language models (LSTM-LMs) have been proven to be powerful and yielded significant performance improvements over count based n-gram LMs in modern speech recognition systems. Due to its infinite history states and computational load, most previous studies focus on applying LSTM-LMs in the second-pass for rescoring purpose. Recent work shows that it is feasible and computationally affordable to adopt the LSTM-LMs in the first-pass decoding within a dynamic (or tree based) decoder framework. In this work, the LSTM-LM is composed with a WFST decoder on-the-fly for the first-pass decoding. Furthermore, motivated by the long-term history nature of LSTM-LMs, the use of context beyond the current utterance is explored for the first-pass decoding in conversational speech recognition. The context information is captured by the hidden states of LSTM-LMs across utterance and can be used to guide the first-pass search effectively. The experimental results in our internal meeting transcription system show that significant performance improvements can be obtained by incorporating the contextual information with LSTM-LMs in the first-pass decoding, compared to applying the contextual information in the second-pass rescoring.

Related papers

SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation [17.56310064245171]
SALMON-N-omni is the first single standalone full-byte speech LLM that operates without its token transition backbone.<n>It features a novel dynamic thinking mechanism within the LLM backbone, enabling the model to learn when between speaking and listening.<n> SALMON-N-omni demonstrates strong performance in complex conversational scenarios, including turn-taking, backchanneling echo cancellation and context-dependent bargein.
arXiv Detail & Related papers (2025-05-17T08:13:59Z)
LegoSLM: Connecting LLM with Speech Encoder using CTC Posteriors [22.845623101142483]
We propose a new paradigm, LegoSLM, that bridges speech encoders and Large Language Models (LLMs)<n>Using the well-performing USM and Gemma models as an example, we demonstrate that our proposed LegoSLM method yields good performance on both ASR and speech translation tasks.
arXiv Detail & Related papers (2025-05-16T15:15:19Z)
Blending LLMs into Cascaded Speech Translation: KIT's Offline Speech Translation System for IWSLT 2024 [61.189875635090225]
Large Language Models (LLMs) are currently under exploration for various tasks, including Automatic Speech Recognition (ASR), Machine Translation (MT), and even End-to-End Speech Translation (ST)
arXiv Detail & Related papers (2024-06-24T16:38:17Z)
What do MLLMs hear? Examining reasoning with text and sound components in Multimodal Large Language Models [6.313516199029267]
We demonstrate through a captioning/classification experiment that an audio MLLM cannot fully leverage its LLM's text-based reasoning when generating audio captions. We also consider how this may be due to MLLMs separately representing auditory and textual information such as it severs the reasoning pathway from the LLM to the audio encoder.
arXiv Detail & Related papers (2024-06-07T03:55:00Z)
ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation. How to effectively encode and understand videos in video-based dialogue systems remains to be solved. We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z)
An Embarrassingly Simple Approach for LLM with Strong ASR Capacity [56.30595787061546]
We focus on solving one of the most important tasks in the field of speech processing, with speech foundation encoders and large language models (LLM) Recent works have complex designs such as compressing the output temporally for the speech encoder, tackling modal alignment for the projector, and utilizing parameter-efficient fine-tuning for the LLM. We found that delicate designs are not necessary, while an embarrassingly simple composition of off-the-shelf speech encoder, LLM, and the only trainable linear projector is competent for the ASR task.
arXiv Detail & Related papers (2024-02-13T23:25:04Z)
TransLLaMa: LLM-based Simultaneous Translation System [18.27477980076409]
We show that a Decoder-only large language model (LLMs) can control input segmentation directly by generating a special "wait" token. This obviates the need for a separate policy and enables the LLM to perform English-German and English-Russian SiMT tasks. We also evaluated closed-source models such as GPT-4, which displayed encouraging results in performing the SiMT task without prior training.
arXiv Detail & Related papers (2024-02-07T07:39:27Z)
Speech Translation with Large Language Models: An Industrial Practice [64.5419534101104]
We introduce LLM-ST, a novel and effective speech translation model constructed upon a pre-trained large language model (LLM) By integrating the large language model (LLM) with a speech encoder and employing multi-task instruction tuning, LLM-ST can produce accurate timestamped transcriptions and translations. Through rigorous experimentation on English and Chinese datasets, we showcase the exceptional performance of LLM-ST.
arXiv Detail & Related papers (2023-12-21T05:32:49Z)
Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model in End-to-End Speech Recognition [23.172469312225694]
We propose to utilize an instruction-tuned large language model (LLM) for guiding the text generation process in automatic speech recognition (ASR) The proposed model is built on the joint CTC and attention architecture, with the LLM serving as a front-end feature extractor for the decoder. Experimental results show that the proposed LLM-guided model achieves a relative gain of approximately 13% in word error rates across major benchmarks.
arXiv Detail & Related papers (2023-09-19T11:10:50Z)
Exploring the Integration of Large Language Models into Automatic Speech Recognition Systems: An Empirical Study [0.0]
This paper explores the integration of Large Language Models (LLMs) into Automatic Speech Recognition (ASR) systems. Our primary focus is to investigate the potential of using an LLM's in-context learning capabilities to enhance the performance of ASR systems.
arXiv Detail & Related papers (2023-07-13T02:31:55Z)
Inference with Reference: Lossless Acceleration of Large Language Models [97.04200102556551]
LLMA is an accelerator to speed up Large Language Model (LLM) inference with references. It is motivated by the observation that there are abundant identical text spans between the decoding result by an LLM and the reference that is available in many real world scenarios.
arXiv Detail & Related papers (2023-04-10T09:55:14Z)
Future Vector Enhanced LSTM Language Model for LVCSR [67.03726018635174]
This paper proposes a novel enhanced long short-term memory (LSTM) LM using the future vector. Experiments show that, the proposed new LSTM LM gets a better result on BLEU scores for long term sequence prediction. Rescoring using both the new and conventional LSTM LMs can achieve a very large improvement on the word error rate.
arXiv Detail & Related papers (2020-07-31T08:38:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.