An Embarrassingly Simple Approach for LLM with Strong ASR Capacity
- URL: http://arxiv.org/abs/2402.08846v1
- Date: Tue, 13 Feb 2024 23:25:04 GMT
- Title: An Embarrassingly Simple Approach for LLM with Strong ASR Capacity
- Authors: Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao
Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, Xie Chen
- Abstract summary: We focus on solving one of the most important tasks in the field of speech processing, with speech foundation encoders and large language models (LLM)
Recent works have complex designs such as compressing the output temporally for the speech encoder, tackling modal alignment for the projector, and utilizing parameter-efficient fine-tuning for the LLM.
We found that delicate designs are not necessary, while an embarrassingly simple composition of off-the-shelf speech encoder, LLM, and the only trainable linear projector is competent for the ASR task.
- Score: 56.30595787061546
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we focus on solving one of the most important tasks in the
field of speech processing, i.e., automatic speech recognition (ASR), with
speech foundation encoders and large language models (LLM). Recent works have
complex designs such as compressing the output temporally for the speech
encoder, tackling modal alignment for the projector, and utilizing
parameter-efficient fine-tuning for the LLM. We found that delicate designs are
not necessary, while an embarrassingly simple composition of off-the-shelf
speech encoder, LLM, and the only trainable linear projector is competent for
the ASR task. To be more specific, we benchmark and explore various
combinations of LLMs and speech encoders, leading to the optimal LLM-based ASR
system, which we call SLAM-ASR. The proposed SLAM-ASR provides a clean setup
and little task-specific design, where only the linear projector is trained. To
the best of our knowledge, SLAM-ASR achieves the best performance on the
Librispeech benchmark among LLM-based ASR models and even outperforms the
latest LLM-based audio-universal model trained on massive pair data. Finally,
we explore the capability emergence of LLM-based ASR in the process of modal
alignment. We hope that our study can facilitate the research on extending LLM
with cross-modality capacity and shed light on the LLM-based ASR community.
Related papers
- Large Language Models Are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.
We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.
We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.81% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z) - CoMMIT: Coordinated Instruction Tuning for Multimodal Large Language Models [68.64605538559312]
In this paper, we analyze the MLLM instruction tuning from both theoretical and empirical perspectives.
Inspired by our findings, we propose a measurement to quantitatively evaluate the learning balance.
In addition, we introduce an auxiliary loss regularization method to promote updating of the generation distribution of MLLMs.
arXiv Detail & Related papers (2024-07-29T23:18:55Z) - Prior Knowledge Integration via LLM Encoding and Pseudo Event Regulation for Video Moment Retrieval [23.94611751368491]
We investigate the feasibility of leveraging large language models (LLMs) for integrating general knowledge and incorporating pseudo-events as priors for temporal content distribution.
To overcome these limitations, we propose utilizing LLM encoders instead of decoders.
We present a general framework for integrating LLM encoders into existing VMR architectures, specifically within the fusion module.
arXiv Detail & Related papers (2024-07-21T04:39:06Z) - New Solutions on LLM Acceleration, Optimization, and Application [14.995654657013741]
Large Language Models (LLMs) have become extremely potent instruments with exceptional capacities for comprehending and producing human-like text in a range of applications.
However, the increasing size and complexity of LLMs present significant challenges in both training and deployment.
We provide a review of recent advancements and research directions aimed at addressing these challenges.
arXiv Detail & Related papers (2024-06-16T11:56:50Z) - Efficient Prompting for LLM-based Generative Internet of Things [88.84327500311464]
Large language models (LLMs) have demonstrated remarkable capacities on various tasks, and integrating the capacities of LLMs into the Internet of Things (IoT) applications has drawn much research attention recently.
Due to security concerns, many institutions avoid accessing state-of-the-art commercial LLM services, requiring the deployment and utilization of open-source LLMs in a local network setting.
We propose a LLM-based Generative IoT (GIoT) system deployed in the local network setting in this study.
arXiv Detail & Related papers (2024-06-14T19:24:00Z) - ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation.
How to effectively encode and understand videos in video-based dialogue systems remains to be solved.
We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z) - Boosting Large Language Model for Speech Synthesis: An Empirical Study [86.89548753080432]
Large language models (LLMs) have made significant advancements in natural language processing and are concurrently extending the language ability to other modalities, such as speech and vision.
We conduct a comprehensive empirical exploration of boosting LLMs with the ability to generate speech, by combining pre-trained LLM LLaMA/OPT and text-to-speech synthesis model VALL-E.
We compare three integration methods between LLMs and speech models, including directly fine-tuned LLMs, superposed layers of LLMs and VALL-E, and coupled LLMs and VALL-E using LLMs as a powerful text encoder
arXiv Detail & Related papers (2023-12-30T14:20:04Z) - Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM
Inference Pipeline [22.08897444328099]
Large language models (LLMs) have revolutionized the field of AI, demonstrating unprecedented capacity across various tasks.
In this paper, we propose an efficient LLM inference pipeline that harnesses the power of LLMs.
arXiv Detail & Related papers (2023-05-22T15:36:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.