Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot
Task Generalization
- URL: http://arxiv.org/abs/2305.11095v3
- Date: Wed, 16 Aug 2023 00:57:34 GMT
- Title: Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot
Task Generalization
- Authors: Puyuan Peng, Brian Yan, Shinji Watanabe, David Harwath
- Abstract summary: We investigate the emergent abilities of the recently proposed web-scale speech model Whisper, by adapting it to unseen tasks with prompt engineering.
We design task-specific prompts, by either leveraging another large-scale model, or simply manipulating the special tokens in the default prompts.
Experiments show that our proposed prompts improve performance by 10% to 45% on the three zero-shot tasks, and even outperform SotA supervised models on some datasets.
- Score: 61.60501633397704
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We investigate the emergent abilities of the recently proposed web-scale
speech model Whisper, by adapting it to unseen tasks with prompt engineering.
We selected three tasks: audio-visual speech recognition (AVSR), code-switched
speech recognition (CS-ASR), and speech translation (ST) on unseen language
pairs. We design task-specific prompts, by either leveraging another
large-scale model, or simply manipulating the special tokens in the default
prompts. Experiments show that compared to the default prompts, our proposed
prompts improve performance by 10% to 45% on the three zero-shot tasks, and
even outperform SotA supervised models on some datasets. In addition, our
experiments reveal many interesting properties of Whisper, including its
robustness to prompts, bias on accents, and the multilingual understanding in
its latent space. Code is available at
https://github.com/jasonppy/PromptingWhisper
Related papers
- VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning [64.56272011710735]
We propose a novel single-stage joint speech-text SFT approach on the low-rank adaptation (LoRA) of the large language models (LLMs) backbone.
Compared to previous SpeechLMs with 7B or 13B parameters, our 3B model demonstrates superior performance across various speech benchmarks.
arXiv Detail & Related papers (2024-10-23T00:36:06Z) - SpeechCaps: Advancing Instruction-Based Universal Speech Models with Multi-Talker Speaking Style Captioning [43.71388370559826]
This paper introduces a multi-talker speaking style captioning task to enhance the understanding of speaker and prosodic information.
We used large language models to generate descriptions for multi-talker speech.
We trained our model with pre-training on this captioning task followed by instruction tuning.
arXiv Detail & Related papers (2024-08-25T17:05:26Z) - SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks [94.10497337235083]
We are first to explore the potential of prompting speech LMs in the domain of speech processing.
We reformulate speech processing tasks into speech-to-unit generation tasks.
We show that the prompting method can achieve competitive performance compared to the strong fine-tuning method.
arXiv Detail & Related papers (2024-08-23T13:00:10Z) - SpeechVerse: A Large-scale Generalizable Audio Language Model [38.67969337605572]
SpeechVerse is a robust multi-task training and curriculum learning framework.
It combines pre-trained speech and text foundation models via a small set of learnable parameters.
Our empirical experiments reveal that our multi-task SpeechVerse model is even superior to conventional task-specific baselines on 9 out of the 11 tasks.
arXiv Detail & Related papers (2024-05-14T03:33:31Z) - SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition [67.08798754009153]
Speech language models typically utilize task-dependent prompt tokens to unify various speech tasks in a single model.
We propose a novel decoder-only speech language model, SpeechComposer, that can unify common speech tasks by composing a fixed set of prompt tokens.
arXiv Detail & Related papers (2024-01-31T18:06:29Z) - SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - WAVPROMPT: Towards Few-Shot Spoken Language Understanding with Frozen
Language Models [57.557319372969495]
Large-scale auto-regressive language models pretrained on massive text have demonstrated their impressive ability to perform new natural language tasks.
Recent studies further show that such a few-shot learning ability can be extended to the text-image setting by training an encoder to encode the images into embeddings.
We propose a novel speech understanding framework, WavPrompt, where we finetune a wav2vec model to generate a sequence of audio embeddings understood by the language model.
arXiv Detail & Related papers (2022-03-29T19:08:55Z) - What shall we do with an hour of data? Speech recognition for the un-
and under-served languages of Common Voice [0.20774268785384567]
This report describes the methods and results of a three-week sprint to produce deployable speech recognition models for 31 under-served languages of the Common Voice project.
arXiv Detail & Related papers (2021-05-10T21:16:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.