SALM: Speech-augmented Language Model with In-context Learning for
Speech Recognition and Translation
- URL: http://arxiv.org/abs/2310.09424v1
- Date: Fri, 13 Oct 2023 22:07:33 GMT
- Title: SALM: Speech-augmented Language Model with In-context Learning for
Speech Recognition and Translation
- Authors: Zhehuai Chen, He Huang, Andrei Andrusenko, Oleksii Hrinchuk, Krishna
C. Puvvada, Jason Li, Subhankar Ghosh, Jagadeesh Balam, Boris Ginsburg
- Abstract summary: We present a novel Speech Augmented Language Model (SALM) with em multitask and em in-context learning capabilities.
The unified SALM achieves performance on par with task-specific Conformer baselines for Automatic Speech Recognition (ASR) and Speech Translation (AST)
- Score: 26.778332992311043
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a novel Speech Augmented Language Model (SALM) with {\em
multitask} and {\em in-context} learning capabilities. SALM comprises a frozen
text LLM, a audio encoder, a modality adapter module, and LoRA layers to
accommodate speech input and associated task instructions. The unified SALM not
only achieves performance on par with task-specific Conformer baselines for
Automatic Speech Recognition (ASR) and Speech Translation (AST), but also
exhibits zero-shot in-context learning capabilities, demonstrated through
keyword-boosting task for ASR and AST. Moreover, {\em speech supervised
in-context training} is proposed to bridge the gap between LLM training and
downstream speech tasks, which further boosts the in-context learning ability
of speech-to-text models. Proposed model is open-sourced via NeMo toolkit.
Related papers
- Self-Powered LLM Modality Expansion for Large Speech-Text Models [62.27700381806554]
Large language models (LLMs) exhibit remarkable performance across diverse tasks.
This study aims to refine the use of speech datasets for LSM training by addressing the limitations of vanilla instruction tuning.
We introduce a self-powered LSM that leverages augmented automatic speech recognition data generated by the model itself for more effective instruction tuning.
arXiv Detail & Related papers (2024-10-04T04:34:24Z) - Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)
We present a simple yet effective automatic process for creating speech-text pair data.
Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions [68.98811048970963]
We present a pioneering effort to investigate the capability of large language models (LLMs) in transcribing speech in multi-talker environments.
Our approach utilizes WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context.
Comprehensive experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios.
arXiv Detail & Related papers (2024-09-13T07:28:28Z) - SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks [94.10497337235083]
We are first to explore the potential of prompting speech LMs in the domain of speech processing.
We reformulate speech processing tasks into speech-to-unit generation tasks.
We show that the prompting method can achieve competitive performance compared to the strong fine-tuning method.
arXiv Detail & Related papers (2024-08-23T13:00:10Z) - An End-to-End Speech Summarization Using Large Language Model [7.562198375754054]
Speech Summarization (SSum) aims to generate human-like text summaries from spoken content.
Research on large language models (LLMs) and multimodal information fusion has provided new insights.
We propose an end-to-end SSum model that utilizes Q-Former as a connector for the audio-text modality.
arXiv Detail & Related papers (2024-07-02T07:22:57Z) - DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding [51.32965203977845]
We propose the use of discrete speech units (DSU) instead of continuous-valued speech encoder outputs.
The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering.
Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
arXiv Detail & Related papers (2024-06-13T17:28:13Z) - Prompting Large Language Models with Audio for General-Purpose Speech Summarization [13.415189715216354]
We introduce a framework for speech summarization that leverages the processing and reasoning capabilities of large language models (LLMs)
We propose an end-to-end system that combines an instruction-tuned LLM with an audio encoder that converts speech into token representations that the LLM can interpret.
arXiv Detail & Related papers (2024-06-10T02:04:28Z) - Instruction-Following Speech Recognition [21.591086644665197]
We introduce instruction-following speech recognition, training a Listen-Attend-Spell model to understand and execute a diverse set of free-form text instructions.
Remarkably, our model, trained from scratch on Librispeech, interprets and executes simple instructions without requiring Large Language Models or pre-trained speech modules.
arXiv Detail & Related papers (2023-09-18T14:59:10Z) - SpeechGen: Unlocking the Generative Power of Speech Language Models with
Prompts [108.04306136086807]
We present research that explores the application of prompt tuning to stimulate speech LMs for various generation tasks, within a unified framework called SpeechGen.
The proposed unified framework holds great promise for efficiency and effectiveness, particularly with the imminent arrival of advanced speech LMs.
arXiv Detail & Related papers (2023-06-03T22:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.