Related papers: An Embarrassingly Simple Approach for LLM with Strong ASR Capacity

An Embarrassingly Simple Approach for LLM with Strong ASR Capacity

URL: http://arxiv.org/abs/2402.08846v1
Date: Tue, 13 Feb 2024 23:25:04 GMT
Title: An Embarrassingly Simple Approach for LLM with Strong ASR Capacity
Authors: Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, Xie Chen
Abstract summary: We focus on solving one of the most important tasks in the field of speech processing, with speech foundation encoders and large language models (LLM) Recent works have complex designs such as compressing the output temporally for the speech encoder, tackling modal alignment for the projector, and utilizing parameter-efficient fine-tuning for the LLM. We found that delicate designs are not necessary, while an embarrassingly simple composition of off-the-shelf speech encoder, LLM, and the only trainable linear projector is competent for the ASR task.
Score: 56.30595787061546
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we focus on solving one of the most important tasks in the field of speech processing, i.e., automatic speech recognition (ASR), with speech foundation encoders and large language models (LLM). Recent works have complex designs such as compressing the output temporally for the speech encoder, tackling modal alignment for the projector, and utilizing parameter-efficient fine-tuning for the LLM. We found that delicate designs are not necessary, while an embarrassingly simple composition of off-the-shelf speech encoder, LLM, and the only trainable linear projector is competent for the ASR task. To be more specific, we benchmark and explore various combinations of LLMs and speech encoders, leading to the optimal LLM-based ASR system, which we call SLAM-ASR. The proposed SLAM-ASR provides a clean setup and little task-specific design, where only the linear projector is trained. To the best of our knowledge, SLAM-ASR achieves the best performance on the Librispeech benchmark among LLM-based ASR models and even outperforms the latest LLM-based audio-universal model trained on massive pair data. Finally, we explore the capability emergence of LLM-based ASR in the process of modal alignment. We hope that our study can facilitate the research on extending LLM with cross-modality capacity and shed light on the LLM-based ASR community.

Related papers

Leveraging Large Language Models in Visual Speech Recognition: Model Scaling, Context-Aware Decoding, and Iterative Polishing [12.635121718700217]
Visual Speech Recognition (VSR) transcribes speech by analyzing lip movements.<n>Large Language Models (LLMs) have been integrated into VSR systems, leading to notable performance improvements.<n>This paper systematically explores how to better leverage LLMs for VSR tasks.
arXiv Detail & Related papers (2025-05-27T21:00:12Z)
Scaling and Enhancing LLM-based AVSR: A Sparse Mixture of Projectors Approach [37.690797152736465]
Llama-SMoP employs a Sparse Mixture of Projectors (SMoP) module to scale model capacity without increasing inference costs.<n>It achieves superior performance on ASR, VSR, and AVSR tasks.
arXiv Detail & Related papers (2025-05-20T13:20:55Z)
Delayed Fusion: Integrating Large Language Models into First-Pass Decoding in End-to-end Speech Recognition [17.376550014426623]
This paper presents an efficient decoding approach for end-to-end automatic speech recognition (E2E-ASR) with large language models (LLMs) We propose "delayed fusion," which applies LLM scores to ASR hypotheses with a delay during decoding. We demonstrate that delayed fusion provides improved decoding speed and accuracy compared to shallow fusion and N-best rescoring.
arXiv Detail & Related papers (2025-01-16T03:01:50Z)
Large Language Models Are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities. We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities. We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.81% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z)
CoMMIT: Coordinated Instruction Tuning for Multimodal Large Language Models [68.64605538559312]
In this paper, we analyze the MLLM instruction tuning from both theoretical and empirical perspectives. Inspired by our findings, we propose a measurement to quantitatively evaluate the learning balance. In addition, we introduce an auxiliary loss regularization method to promote updating of the generation distribution of MLLMs.
arXiv Detail & Related papers (2024-07-29T23:18:55Z)
Prior Knowledge Integration via LLM Encoding and Pseudo Event Regulation for Video Moment Retrieval [23.94611751368491]
We investigate the feasibility of leveraging large language models (LLMs) for integrating general knowledge and incorporating pseudo-events as priors for temporal content distribution. To overcome these limitations, we propose utilizing LLM encoders instead of decoders. We present a general framework for integrating LLM encoders into existing VMR architectures, specifically within the fusion module.
arXiv Detail & Related papers (2024-07-21T04:39:06Z)
New Solutions on LLM Acceleration, Optimization, and Application [14.995654657013741]
Large Language Models (LLMs) have become extremely potent instruments with exceptional capacities for comprehending and producing human-like text in a range of applications. However, the increasing size and complexity of LLMs present significant challenges in both training and deployment. We provide a review of recent advancements and research directions aimed at addressing these challenges.
arXiv Detail & Related papers (2024-06-16T11:56:50Z)
Efficient Prompting for LLM-based Generative Internet of Things [88.84327500311464]
Large language models (LLMs) have demonstrated remarkable capacities on various tasks, and integrating the capacities of LLMs into the Internet of Things (IoT) applications has drawn much research attention recently. Due to security concerns, many institutions avoid accessing state-of-the-art commercial LLM services, requiring the deployment and utilization of open-source LLMs in a local network setting. We propose a LLM-based Generative IoT (GIoT) system deployed in the local network setting in this study.
arXiv Detail & Related papers (2024-06-14T19:24:00Z)
ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation. How to effectively encode and understand videos in video-based dialogue systems remains to be solved. We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z)
Boosting Large Language Model for Speech Synthesis: An Empirical Study [86.89548753080432]
Large language models (LLMs) have made significant advancements in natural language processing and are concurrently extending the language ability to other modalities, such as speech and vision. We conduct a comprehensive empirical exploration of boosting LLMs with the ability to generate speech, by combining pre-trained LLM LLaMA/OPT and text-to-speech synthesis model VALL-E. We compare three integration methods between LLMs and speech models, including directly fine-tuned LLMs, superposed layers of LLMs and VALL-E, and coupled LLMs and VALL-E using LLMs as a powerful text encoder
arXiv Detail & Related papers (2023-12-30T14:20:04Z)
Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline [22.08897444328099]
Large language models (LLMs) have revolutionized the field of AI, demonstrating unprecedented capacity across various tasks. In this paper, we propose an efficient LLM inference pipeline that harnesses the power of LLMs.
arXiv Detail & Related papers (2023-05-22T15:36:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.