SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding
- URL: http://arxiv.org/abs/2010.02295v3
- Date: Mon, 15 Mar 2021 00:55:04 GMT
- Title: SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding
- Authors: Yu-An Chung, Chenguang Zhu, Michael Zeng
- Abstract summary: Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
- Score: 61.02342238771685
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spoken language understanding (SLU) requires a model to analyze input
acoustic signal to understand its linguistic content and make predictions. To
boost the models' performance, various pre-training methods have been proposed
to learn rich representations from large-scale unannotated speech and text.
However, the inherent disparities between the two modalities necessitate a
mutual analysis. In this paper, we propose a novel semi-supervised learning
framework, SPLAT, to jointly pre-train the speech and language modules. Besides
conducting a self-supervised masked language modeling task on the two
individual modules using unpaired speech and text, SPLAT aligns representations
from the two modules in a shared latent space using a small amount of paired
speech and text. Thus, during fine-tuning, the speech module alone can produce
representations carrying both acoustic information and contextual semantic
knowledge of an input acoustic signal. Experimental results verify the
effectiveness of our approach on various SLU tasks. For example, SPLAT improves
the previous state-of-the-art performance on the Spoken SQuAD dataset by more
than 10%.
Related papers
- LAST: Language Model Aware Speech Tokenization [24.185165710384997]
We propose a novel approach to training a speech tokenizer by leveraging objectives from pre-trained textual LMs.
Our aim is to transform features from a pre-trained speech model into a new feature space that enables better clustering for speech LMs.
arXiv Detail & Related papers (2024-09-05T16:57:39Z) - Toward Joint Language Modeling for Speech Units and Text [89.32163954508489]
We explore joint language modeling for speech units and text.
We introduce automatic metrics to evaluate how well the joint LM mixes speech and text.
Our results show that by mixing speech units and text with our proposed mixing techniques, the joint LM improves over a speech-only baseline on SLU tasks.
arXiv Detail & Related papers (2023-10-12T20:53:39Z) - Few-Shot Spoken Language Understanding via Joint Speech-Text Models [18.193191170754744]
Recent work on speech representation models jointly pre-trained with text has demonstrated the potential of improving speech representations.
We leverage such shared representations to address the persistent challenge of limited data availability in spoken language understanding tasks.
By employing a pre-trained speech-text model, we find that models fine-tuned on text can be effectively transferred to speech testing data.
arXiv Detail & Related papers (2023-10-09T17:59:21Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z) - Leveraging Pre-trained Language Model for Speech Sentiment Analysis [58.78839114092951]
We explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis.
We propose a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach.
arXiv Detail & Related papers (2021-06-11T20:15:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.