WAVPROMPT: Towards Few-Shot Spoken Language Understanding with Frozen
Language Models
- URL: http://arxiv.org/abs/2203.15863v1
- Date: Tue, 29 Mar 2022 19:08:55 GMT
- Title: WAVPROMPT: Towards Few-Shot Spoken Language Understanding with Frozen
Language Models
- Authors: Heting Gao, Junrui Ni, Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark
Hasegawa-Johnson
- Abstract summary: Large-scale auto-regressive language models pretrained on massive text have demonstrated their impressive ability to perform new natural language tasks.
Recent studies further show that such a few-shot learning ability can be extended to the text-image setting by training an encoder to encode the images into embeddings.
We propose a novel speech understanding framework, WavPrompt, where we finetune a wav2vec model to generate a sequence of audio embeddings understood by the language model.
- Score: 57.557319372969495
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale auto-regressive language models pretrained on massive text have
demonstrated their impressive ability to perform new natural language tasks
with only a few text examples, without the need for fine-tuning. Recent studies
further show that such a few-shot learning ability can be extended to the
text-image setting by training an encoder to encode the images into embeddings
functioning like the text embeddings of the language model. Interested in
exploring the possibility of transferring the few-shot learning ability to the
audio-text setting, we propose a novel speech understanding framework,
WavPrompt, where we finetune a wav2vec model to generate a sequence of audio
embeddings understood by the language model. We show that WavPrompt is a
few-shot learner that can perform speech understanding tasks better than a
naive text baseline. We conduct detailed ablation studies on different
components and hyperparameters to empirically identify the best model
configuration. In addition, we conduct a non-speech understanding experiment to
show WavPrompt can extract more information than just the transcriptions.
Related papers
- Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis [13.702423348269155]
We propose a new task -- generating speech from videos of people and their transcripts (VTTS) -- to motivate new techniques for multimodal speech generation.
We present a decoder-only multimodal model for this task, which we call Visatronic.
It embeds vision, text and speech directly into the common subspace of a transformer model and uses an autoregressive loss to learn a generative model of discretized mel-spectrograms conditioned on speaker videos and transcripts of their speech.
arXiv Detail & Related papers (2024-11-26T18:57:29Z) - Seal: Advancing Speech Language Models to be Few-Shot Learners [17.03216447533895]
This paper introduces the Seal model, an abbreviation for speech language model.
It incorporates a novel alignment method, in which Kullback-Leibler divergence loss is performed to train a projector that bridges a frozen speech learner with a frozen language model decoder.
The resulting Seal model exhibits robust performance as a few-shot encoder on two speech understanding tasks.
arXiv Detail & Related papers (2024-07-20T13:28:12Z) - Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation [55.15299351110525]
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model.
We propose a novel training strategy, processing with visual speech units.
We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models.
arXiv Detail & Related papers (2024-01-18T08:46:02Z) - Teach me with a Whisper: Enhancing Large Language Models for Analyzing
Spoken Transcripts using Speech Embeddings [8.660203441911554]
We propose a methodology for training language models leveraging spoken language audio data.
This leads to an improved language model for analyzing spoken transcripts while avoiding an audio processing overhead at test time.
In our experiments, the student model achieves consistent improvement over traditional language models on tasks analyzing spoken transcripts.
arXiv Detail & Related papers (2023-11-13T01:53:12Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - TextMI: Textualize Multimodal Information for Integrating Non-verbal
Cues in Pre-trained Language Models [5.668457303716451]
We propose TextMI as a general, competitive baseline for multimodal behavioral analysis tasks.
Our approach significantly reduces model complexity, adds interpretability to the model's decision, and can be applied for a diverse set of tasks.
arXiv Detail & Related papers (2023-03-27T17:54:32Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.