CIF-PT: Bridging Speech and Text Representations for Spoken Language
Understanding via Continuous Integrate-and-Fire Pre-Training
- URL: http://arxiv.org/abs/2305.17499v1
- Date: Sat, 27 May 2023 15:39:13 GMT
- Title: CIF-PT: Bridging Speech and Text Representations for Spoken Language
Understanding via Continuous Integrate-and-Fire Pre-Training
- Authors: Linhao Dong, Zhecheng An, Peihao Wu, Jun Zhang, Lu Lu, Zejun Ma
- Abstract summary: We propose a novel pre-training paradigm termed Continuous Integrate-and-Fire Pre-Training (CIF-PT)
It relies on a simple but effective frame-to-token alignment: continuous integrate-and-fire (CIF) to bridge the representations between speech and text.
CIF-PT outperforms the state-of-the-art model by 1.94% of accuracy and 2.71% of SLU-F1 on the tasks of intent classification and slot filling.
- Score: 16.361505093510665
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech or text representation generated by pre-trained models contains
modal-specific information that could be combined for benefiting spoken
language understanding (SLU) tasks. In this work, we propose a novel
pre-training paradigm termed Continuous Integrate-and-Fire Pre-Training
(CIF-PT). It relies on a simple but effective frame-to-token alignment:
continuous integrate-and-fire (CIF) to bridge the representations between
speech and text. It jointly performs speech-to-text training and language model
distillation through CIF as the pre-training (PT). Evaluated on SLU benchmark
SLURP dataset, CIF-PT outperforms the state-of-the-art model by 1.94% of
accuracy and 2.71% of SLU-F1 on the tasks of intent classification and slot
filling, respectively. We also observe the cross-modal representation extracted
by CIF-PT obtains better performance than other neural interfaces for the tasks
of SLU, including the dominant speech representation learned from
self-supervised pre-training.
Related papers
- Uni-Sign: Toward Unified Sign Language Understanding at Scale [90.76641997060513]
We propose a unified pre-training framework that eliminates the gap between pre-training and downstream SLU tasks.
Uni-Sign achieves state-of-the-art performance across multiple downstream SLU tasks.
arXiv Detail & Related papers (2025-01-25T11:51:23Z) - Improving Textless Spoken Language Understanding with Discrete Units as
Intermediate Target [58.59044226658916]
Spoken Language Understanding (SLU) is a task that aims to extract semantic information from spoken utterances.
We propose to use discrete units as intermediate guidance to improve textless SLU performance.
arXiv Detail & Related papers (2023-05-29T14:00:24Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - Bridging Speech and Textual Pre-trained Models with Unsupervised ASR [70.61449720963235]
This work proposes a simple yet efficient unsupervised paradigm that connects speech and textual pre-trained models.
We show that unsupervised automatic speech recognition (ASR) can improve the representations from speech self-supervised models.
Notably, on spoken question answering, we reach the state-of-the-art result over the challenging NMSQA benchmark.
arXiv Detail & Related papers (2022-11-06T04:50:37Z) - WaBERT: A Low-resource End-to-end Model for Spoken Language
Understanding and Speech-to-BERT Alignment [2.7505260301752763]
We propose a novel end-to-end model combining the speech model and the language model for SLU tasks.
WaBERT is based on the pre-trained speech and language model, hence training from scratch is not needed.
arXiv Detail & Related papers (2022-04-22T02:14:40Z) - SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text
Joint Pre-Training [33.02912456062474]
We build a single encoder with the BERT objective on unlabeled text together with the w2v-BERT objective on unlabeled speech.
We demonstrate that incorporating both speech and text data during pre-training can significantly improve downstream quality on CoVoST2 speech translation.
arXiv Detail & Related papers (2021-10-20T00:59:36Z) - Pre-training for Spoken Language Understanding with Joint Textual and
Phonetic Representation Learning [4.327558819000435]
We propose a novel joint textual-phonetic pre-training approach for learning spoken language representations.
Experimental results on spoken language understanding benchmarks, Fluent Speech Commands and SNIPS, show that the proposed approach significantly outperforms strong baseline models.
arXiv Detail & Related papers (2021-04-21T05:19:13Z) - Speak or Chat with Me: End-to-End Spoken Language Understanding System
with Flexible Inputs [21.658650440278063]
We propose a novel system that can predict intents from flexible types of inputs: speech, ASR transcripts, or both.
Our experiments show significant advantages for these pre-training and fine-tuning strategies, resulting in a system that achieves competitive intent-classification performance.
arXiv Detail & Related papers (2021-04-07T20:48:08Z) - Semi-Supervised Spoken Language Understanding via Self-Supervised Speech
and Language Model Pretraining [64.35907499990455]
We propose a framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech.
Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT.
In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation.
arXiv Detail & Related papers (2020-10-26T18:21:27Z) - ST-BERT: Cross-modal Language Model Pre-training For End-to-end Spoken
Language Understanding [23.367329217151084]
We introduce a cross-modal pre-trained language model, called Speech-Text BERT (ST-BERT), to tackle end-to-end spoken language understanding tasks.
Taking phoneme posterior and subword-level text as an input, ST-BERT learns a contextualized cross-modal alignment.
Our method shows further SLU performance gain via domain-adaptive pre-training with domain-specific speech-text pair data.
arXiv Detail & Related papers (2020-10-23T10:28:20Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.