Instruction-Following Speech Recognition
- URL: http://arxiv.org/abs/2309.09843v1
- Date: Mon, 18 Sep 2023 14:59:10 GMT
- Title: Instruction-Following Speech Recognition
- Authors: Cheng-I Jeff Lai, Zhiyun Lu, Liangliang Cao, Ruoming Pang
- Abstract summary: We introduce instruction-following speech recognition, training a Listen-Attend-Spell model to understand and execute a diverse set of free-form text instructions.
Remarkably, our model, trained from scratch on Librispeech, interprets and executes simple instructions without requiring Large Language Models or pre-trained speech modules.
- Score: 21.591086644665197
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Conventional end-to-end Automatic Speech Recognition (ASR) models primarily
focus on exact transcription tasks, lacking flexibility for nuanced user
interactions. With the advent of Large Language Models (LLMs) in speech
processing, more organic, text-prompt-based interactions have become possible.
However, the mechanisms behind these models' speech understanding and
"reasoning" capabilities remain underexplored. To study this question from the
data perspective, we introduce instruction-following speech recognition,
training a Listen-Attend-Spell model to understand and execute a diverse set of
free-form text instructions. This enables a multitude of speech recognition
tasks -- ranging from transcript manipulation to summarization -- without
relying on predefined command sets. Remarkably, our model, trained from scratch
on Librispeech, interprets and executes simple instructions without requiring
LLMs or pre-trained speech modules. It also offers selective transcription
options based on instructions like "transcribe first half and then turn off
listening," providing an additional layer of privacy and safety compared to
existing LLMs. Our findings highlight the significant potential of
instruction-following training to advance speech foundation models.
Related papers
- Self-Powered LLM Modality Expansion for Large Speech-Text Models [62.27700381806554]
Large language models (LLMs) exhibit remarkable performance across diverse tasks.
This study aims to refine the use of speech datasets for LSM training by addressing the limitations of vanilla instruction tuning.
We introduce a self-powered LSM that leverages augmented automatic speech recognition data generated by the model itself for more effective instruction tuning.
arXiv Detail & Related papers (2024-10-04T04:34:24Z) - Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)
We present a simple yet effective automatic process for creating speech-text pair data.
Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions [68.98811048970963]
We present a pioneering effort to investigate the capability of large language models (LLMs) in transcribing speech in multi-talker environments.
Our approach utilizes WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context.
Comprehensive experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios.
arXiv Detail & Related papers (2024-09-13T07:28:28Z) - SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks [94.10497337235083]
We are first to explore the potential of prompting speech LMs in the domain of speech processing.
We reformulate speech processing tasks into speech-to-unit generation tasks.
We show that the prompting method can achieve competitive performance compared to the strong fine-tuning method.
arXiv Detail & Related papers (2024-08-23T13:00:10Z) - DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment [82.86363991170546]
We propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities.
Our model demonstrates superior performance on the Dynamic-SUPERB benchmark, particularly in generalizing to unseen tasks.
These findings highlight the potential to reshape instruction-following SLMs by incorporating descriptive rich, speech captions.
arXiv Detail & Related papers (2024-06-27T03:52:35Z) - SALM: Speech-augmented Language Model with In-context Learning for
Speech Recognition and Translation [26.778332992311043]
We present a novel Speech Augmented Language Model (SALM) with em multitask and em in-context learning capabilities.
The unified SALM achieves performance on par with task-specific Conformer baselines for Automatic Speech Recognition (ASR) and Speech Translation (AST)
arXiv Detail & Related papers (2023-10-13T22:07:33Z) - BLSP: Bootstrapping Language-Speech Pre-training via Behavior Alignment of Continuation Writing [35.31866559807704]
modality alignment between speech and text remains an open problem.
We propose the BLSP approach that bootstraps Language-Speech Pre-training via behavior alignment of continuation writing.
We demonstrate that this straightforward process can extend the capabilities of LLMs to speech, enabling speech recognition, speech translation, spoken language understanding, and speech conversation, even in zero-shot cross-lingual scenarios.
arXiv Detail & Related papers (2023-09-02T11:46:05Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.