Related papers: Comparing Discrete and Continuous Space LLMs for Speech Recognition

Comparing Discrete and Continuous Space LLMs for Speech Recognition

URL: http://arxiv.org/abs/2409.00800v1
Date: Sun, 1 Sep 2024 18:29:45 GMT
Title: Comparing Discrete and Continuous Space LLMs for Speech Recognition
Authors: Yaoxun Xu, Shi-Xiong Zhang, Jianwei Yu, Zhiyong Wu, Dong Yu,
Abstract summary: This paper investigates discrete and continuous speech representations in Large Language Model (LLM)-based Automatic Speech Recognition (ASR) We classify LLMs based on their input and autoregressive feedback into continuous and discrete-space models. We present an open-sourced achievement of a state-of-the-art Word Error Rate (WER) of 1.69% on LibriSpeech using a HuBERT encoder.
Score: 46.70297458685438
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: This paper investigates discrete and continuous speech representations in Large Language Model (LLM)-based Automatic Speech Recognition (ASR), organizing them by feature continuity and training approach into four categories: supervised and unsupervised for both discrete and continuous types. We further classify LLMs based on their input and autoregressive feedback into continuous and discrete-space models. Using specialized encoders and comparative analysis with a Joint-Training-From-Scratch Language Model (JTFS LM) and pre-trained LLaMA2-7b, we provide a detailed examination of their effectiveness. Our work marks the first extensive comparison of speech representations in LLM-based ASR and explores various modeling techniques. We present an open-sourced achievement of a state-of-the-art Word Error Rate (WER) of 1.69\% on LibriSpeech using a HuBERT encoder, offering valuable insights for advancing ASR and natural language processing (NLP) research.

Related papers

A Unified Speech LLM for Diarization and Speech Recognition in Multilingual Conversations [25.58593495281218]
We propose a unified speech LLM that jointly performs diarization and ASR in an end-to-end manner.<n>By reformulating the training data format and modifying the inference procedure, our model addresses the ambiguity inherent in pre-segmented audio.
arXiv Detail & Related papers (2025-06-26T01:54:02Z)
OpusLM: A Family of Open Unified Speech Language Models [56.14140121061921]
The OpusLMs are continuously pre-trained on 213K hours of speech-text pairs and 292B text-only tokens.<n>Technically, this paper articulates our SpeechLM designs on tokenization, multi-stream language models, and multi-stage training strategies.
arXiv Detail & Related papers (2025-06-21T06:30:59Z)
DLM-One: Diffusion Language Models for One-Step Sequence Generation [63.43422118066493]
DLM-One is a score-distillation-based framework for one-step sequence generation with continuous diffusion language models.<n>We investigate whether DLM-One can achieve substantial gains in sampling efficiency for language modeling.
arXiv Detail & Related papers (2025-05-30T22:42:23Z)
In-context Language Learning for Endangered Languages in Speech Recognition [15.294500162002345]
We investigate whether large language models (LLMs) can learn unseen, low-resource languages through in-context learning (ICL)<n>We show ICL enables LLMs to achieve ASR performance that is comparable to or even surpasses dedicated language models trained specifically for these languages.
arXiv Detail & Related papers (2025-05-26T18:38:59Z)
Zero-resource Speech Translation and Recognition with LLMs [38.11535502039386]
We propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data. We achieve this by using a pre-trained multilingual speech encoder, a multilingual LLM, and a lightweight adaptation module that maps the audio representations to the token embedding space of the LLM.
arXiv Detail & Related papers (2024-12-24T17:37:11Z)
Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback [50.84142264245052]
This work introduces the Align-SLM framework to enhance the semantic understanding of textless Spoken Language Models (SLMs) Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO) We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation.
arXiv Detail & Related papers (2024-11-04T06:07:53Z)
Large Language Models Are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities. We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities. We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.81% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z)
MooER: LLM-based Speech Recognition and Translation Models from Moore Threads [13.02816167879662]
MooER is a large-scale automatic speech recognition (ASR) / automatic speech translation (AST) model of Moore Threads. A 5000h pseudo labeled dataset containing open source and self collected speech data is used for training. Experiments conducted on Covost2 Zh2en testset suggest that our model outperforms other open source Speech LLMs.
arXiv Detail & Related papers (2024-08-09T14:43:56Z)
Investigating Decoder-only Large Language Models for Speech-to-text Translation [39.17113782374464]
Large language models (LLMs) are known for their exceptional reasoning capabilities, generalizability, and fluency across diverse domains. We propose a decoder-only architecture that enables the LLM to directly consume the encoded speech representation and generate the text translation. Our model achieves state-of-the-art performance on CoVoST 2 and FLEURS among models trained without proprietary data.
arXiv Detail & Related papers (2024-07-03T14:42:49Z)
A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision [74.972172804514]
We introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text. New dataset annotations provide continuous sign-level annotations for six hours of test videos, and will be made publicly available. Our model significantly outperforms the previous state of the art on both tasks.
arXiv Detail & Related papers (2024-05-16T17:19:06Z)
Speech Translation with Large Language Models: An Industrial Practice [64.5419534101104]
We introduce LLM-ST, a novel and effective speech translation model constructed upon a pre-trained large language model (LLM) By integrating the large language model (LLM) with a speech encoder and employing multi-task instruction tuning, LLM-ST can produce accurate timestamped transcriptions and translations. Through rigorous experimentation on English and Chinese datasets, we showcase the exceptional performance of LLM-ST.
arXiv Detail & Related papers (2023-12-21T05:32:49Z)
Exploring the Integration of Large Language Models into Automatic Speech Recognition Systems: An Empirical Study [0.0]
This paper explores the integration of Large Language Models (LLMs) into Automatic Speech Recognition (ASR) systems. Our primary focus is to investigate the potential of using an LLM's in-context learning capabilities to enhance the performance of ASR systems.
arXiv Detail & Related papers (2023-07-13T02:31:55Z)
W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training [49.47516627019855]
w2v-BERT is a framework that combines contrastive learning and pre-supervised speech learning. Our experiments show that w2v-BERT achieves competitive results compared to current state-of-the-art pre-trained models.
arXiv Detail & Related papers (2021-08-07T06:29:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.