An Embarrassingly Simple Approach for LLM with Strong ASR Capacity
- URL: http://arxiv.org/abs/2402.08846v1
- Date: Tue, 13 Feb 2024 23:25:04 GMT
- Title: An Embarrassingly Simple Approach for LLM with Strong ASR Capacity
- Authors: Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao
Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, Xie Chen
- Abstract summary: We focus on solving one of the most important tasks in the field of speech processing, with speech foundation encoders and large language models (LLM)
Recent works have complex designs such as compressing the output temporally for the speech encoder, tackling modal alignment for the projector, and utilizing parameter-efficient fine-tuning for the LLM.
We found that delicate designs are not necessary, while an embarrassingly simple composition of off-the-shelf speech encoder, LLM, and the only trainable linear projector is competent for the ASR task.
- Score: 56.30595787061546
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we focus on solving one of the most important tasks in the
field of speech processing, i.e., automatic speech recognition (ASR), with
speech foundation encoders and large language models (LLM). Recent works have
complex designs such as compressing the output temporally for the speech
encoder, tackling modal alignment for the projector, and utilizing
parameter-efficient fine-tuning for the LLM. We found that delicate designs are
not necessary, while an embarrassingly simple composition of off-the-shelf
speech encoder, LLM, and the only trainable linear projector is competent for
the ASR task. To be more specific, we benchmark and explore various
combinations of LLMs and speech encoders, leading to the optimal LLM-based ASR
system, which we call SLAM-ASR. The proposed SLAM-ASR provides a clean setup
and little task-specific design, where only the linear projector is trained. To
the best of our knowledge, SLAM-ASR achieves the best performance on the
Librispeech benchmark among LLM-based ASR models and even outperforms the
latest LLM-based audio-universal model trained on massive pair data. Finally,
we explore the capability emergence of LLM-based ASR in the process of modal
alignment. We hope that our study can facilitate the research on extending LLM
with cross-modality capacity and shed light on the LLM-based ASR community.
Related papers
- Prior Knowledge Integration via LLM Encoding and Pseudo Event Regulation for Video Moment Retrieval [23.94611751368491]
We investigate the feasibility of leveraging large language models (LLMs) for integrating general knowledge and incorporating pseudo-events as priors for temporal content distribution.
To overcome these limitations, we propose utilizing LLM encoders instead of decoders.
We present a general framework for integrating LLM encoders into existing VMR architectures, specifically within the fusion module.
arXiv Detail & Related papers (2024-07-21T04:39:06Z) - Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks.
LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning.
We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z) - New Solutions on LLM Acceleration, Optimization, and Application [14.995654657013741]
Large Language Models (LLMs) have become extremely potent instruments with exceptional capacities for comprehending and producing human-like text in a range of applications.
However, the increasing size and complexity of LLMs present significant challenges in both training and deployment.
We provide a review of recent advancements and research directions aimed at addressing these challenges.
arXiv Detail & Related papers (2024-06-16T11:56:50Z) - Efficient Prompting for LLM-based Generative Internet of Things [88.84327500311464]
Large language models (LLMs) have demonstrated remarkable capacities on various tasks.
We propose a text-based generative IoT (GIoT) system deployed in the local network setting.
arXiv Detail & Related papers (2024-06-14T19:24:00Z) - One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models [67.49462724595445]
Retrieval-augmented generation (RAG) is a promising way to improve large language models (LLMs)
We propose a novel method that involves learning scalable and pluggable virtual tokens for RAG.
arXiv Detail & Related papers (2024-05-30T03:44:54Z) - ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation.
How to effectively encode and understand videos in video-based dialogue systems remains to be solved.
We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z) - Boosting Large Language Model for Speech Synthesis: An Empirical Study [86.89548753080432]
Large language models (LLMs) have made significant advancements in natural language processing and are concurrently extending the language ability to other modalities, such as speech and vision.
We conduct a comprehensive empirical exploration of boosting LLMs with the ability to generate speech, by combining pre-trained LLM LLaMA/OPT and text-to-speech synthesis model VALL-E.
We compare three integration methods between LLMs and speech models, including directly fine-tuned LLMs, superposed layers of LLMs and VALL-E, and coupled LLMs and VALL-E using LLMs as a powerful text encoder
arXiv Detail & Related papers (2023-12-30T14:20:04Z) - Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model
in End-to-End Speech Recognition [26.043533280932603]
We present a novel integration of an instruction-tuned large language model (LLM) and end-to-end automatic speech recognition (ASR)
We explore using this zero-shot capability of LLMs to extract linguistic information that can contribute to improving ASR performance.
arXiv Detail & Related papers (2023-09-19T11:10:50Z) - Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM
Inference Pipeline [22.08897444328099]
Large language models (LLMs) have revolutionized the field of AI, demonstrating unprecedented capacity across various tasks.
In this paper, we propose an efficient LLM inference pipeline that harnesses the power of LLMs.
arXiv Detail & Related papers (2023-05-22T15:36:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.