Connecting Speech Encoder and Large Language Model for ASR
- URL: http://arxiv.org/abs/2309.13963v2
- Date: Tue, 26 Sep 2023 11:09:25 GMT
- Title: Connecting Speech Encoder and Large Language Model for ASR
- Authors: Wenyi Yu and Changli Tang and Guangzhi Sun and Xianzhao Chen and Tian
Tan and Wei Li and Lu Lu and Zejun Ma and Chao Zhang
- Abstract summary: The impressive capability and versatility of large language models (LLMs) have aroused increasing attention in automatic speech recognition (ASR)
This paper presents a comparative study of three commonly used structures as connectors, including fully connected layers, multi-head cross-attention, and Q-Former.
Experiments were performed on the commonly used LibriSpeech, Common Voice, and GigaSpeech datasets.
- Score: 25.660343393359565
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The impressive capability and versatility of large language models (LLMs)
have aroused increasing attention in automatic speech recognition (ASR), with
several pioneering studies attempting to build integrated ASR models by
connecting a speech encoder with an LLM. This paper presents a comparative
study of three commonly used structures as connectors, including fully
connected layers, multi-head cross-attention, and Q-Former. Speech encoders
from the Whisper model series as well as LLMs from the Vicuna model series with
different model sizes were studied. Experiments were performed on the commonly
used LibriSpeech, Common Voice, and GigaSpeech datasets, where the LLMs with
Q-Formers demonstrated consistent and considerable word error rate (WER)
reductions over LLMs with other connector structures. Q-Former-based LLMs can
generalise well to out-of-domain datasets, where 12% relative WER reductions
over the Whisper baseline ASR model were achieved on the Eval2000 test set
without using any in-domain training data from Switchboard. Moreover, a novel
segment-level Q-Former is proposed to enable LLMs to recognise speech segments
with a duration exceeding the limitation of the encoders, which results in 17%
relative WER reductions over other connector structures on 90-second-long
speech data.
Related papers
- Large Language Models Are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.
We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.
We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.81% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z) - Advancing Multi-talker ASR Performance with Large Language Models [48.52252970956368]
Recognizing overlapping speech from multiple speakers in conversational scenarios is one of the most challenging problem for automatic speech recognition (ASR)
In this paper, we propose an LLM-based SOT approach for multi-talker ASR, leveraging pre-trained speech encoder and LLM.
Our approach surpasses traditional AED-based methods on the simulated dataset LibriMix and achieves state-of-the-art performance on the evaluation set of the real-world dataset AMI.
arXiv Detail & Related papers (2024-08-30T17:29:25Z) - DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding [51.32965203977845]
We propose the use of discrete speech units (DSU) instead of continuous-valued speech encoder outputs.
The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering.
Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
arXiv Detail & Related papers (2024-06-13T17:28:13Z) - An Embarrassingly Simple Approach for LLM with Strong ASR Capacity [56.30595787061546]
We focus on solving one of the most important tasks in the field of speech processing, with speech foundation encoders and large language models (LLM)
Recent works have complex designs such as compressing the output temporally for the speech encoder, tackling modal alignment for the projector, and utilizing parameter-efficient fine-tuning for the LLM.
We found that delicate designs are not necessary, while an embarrassingly simple composition of off-the-shelf speech encoder, LLM, and the only trainable linear projector is competent for the ASR task.
arXiv Detail & Related papers (2024-02-13T23:25:04Z) - Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker.
Network addresses limitations of SIMO models by aggregating cross-speaker representations.
Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z) - Prompting Large Language Models with Speech Recognition Abilities [31.77576008965215]
We extend the capabilities of large language models by directly attaching a small audio encoder allowing it to perform speech recognition.
Experiments on MultilingualSpeech show that incorporating a conformer encoder into the open sourced LLaMA-7B allows it to outperform monolingual baselines by 18%.
arXiv Detail & Related papers (2023-07-21T08:39:15Z) - Streaming Multi-Talker ASR with Token-Level Serialized Output Training [53.11450530896623]
t-SOT is a novel framework for streaming multi-talker automatic speech recognition.
The t-SOT model has the advantages of less inference cost and a simpler model architecture.
For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2022-02-02T01:27:21Z) - Cross-Utterance Language Models with Acoustic Error Sampling [1.376408511310322]
Cross-utterance LM (CULM) is proposed to augment the input to a standard long short-term memory (LSTM) LM.
An acoustic error sampling technique is proposed to reduce the mismatch between training and test-time.
Experiments performed on both AMI and Switchboard datasets show that CULMs outperform the LSTM LM baseline WER.
arXiv Detail & Related papers (2020-08-19T17:40:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.