A Study on the Integration of Pre-trained SSL, ASR, LM and SLU Models
for Spoken Language Understanding
- URL: http://arxiv.org/abs/2211.05869v1
- Date: Thu, 10 Nov 2022 20:59:13 GMT
- Title: A Study on the Integration of Pre-trained SSL, ASR, LM and SLU Models
for Spoken Language Understanding
- Authors: Yifan Peng, Siddhant Arora, Yosuke Higuchi, Yushi Ueda, Sujay Kumar,
Karthik Ganesan, Siddharth Dalmia, Xuankai Chang, Shinji Watanabe
- Abstract summary: We employ four types of pre-trained models and their combinations for spoken language understanding (SLU)
We leverage self-supervised speech and language models (LM) pre-trained on large quantities of unpaired data to extract strong speech and text representations.
We also explore using supervised models pre-trained on larger external automatic speech recognition (ASR) or SLU corpora.
- Score: 42.345266746904514
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Collecting sufficient labeled data for spoken language understanding (SLU) is
expensive and time-consuming. Recent studies achieved promising results by
using pre-trained models in low-resource scenarios. Inspired by this, we aim to
ask: which (if any) pre-training strategies can improve performance across SLU
benchmarks? To answer this question, we employ four types of pre-trained models
and their combinations for SLU. We leverage self-supervised speech and language
models (LM) pre-trained on large quantities of unpaired data to extract strong
speech and text representations. We also explore using supervised models
pre-trained on larger external automatic speech recognition (ASR) or SLU
corpora. We conduct extensive experiments on the SLU Evaluation (SLUE)
benchmark and observe self-supervised pre-trained models to be more powerful,
with pre-trained LM and speech models being most beneficial for the Sentiment
Analysis and Named Entity Recognition task, respectively.
Related papers
- VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons.
We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z) - Pushing the Limits of Unsupervised Unit Discovery for SSL Speech
Representation [12.506633315768832]
HuBERT is a successful example that utilizes offline clustering to convert speech features into discrete units for a masked language modeling pretext task.
We present an unsupervised method to improve SSL targets.
Two models are proposed, MonoBERT and PolyBERT, which leverage context-independent and context-dependent phoneme-based units for pre-training.
arXiv Detail & Related papers (2023-06-15T07:45:12Z) - Improving Textless Spoken Language Understanding with Discrete Units as
Intermediate Target [58.59044226658916]
Spoken Language Understanding (SLU) is a task that aims to extract semantic information from spoken utterances.
We propose to use discrete units as intermediate guidance to improve textless SLU performance.
arXiv Detail & Related papers (2023-05-29T14:00:24Z) - The Interpreter Understands Your Meaning: End-to-end Spoken Language
Understanding Aided by Speech Translation [13.352795145385645]
Speech translation (ST) is a good means of pretraining speech models for end-to-end spoken language understanding.
We show that our models reach higher performance over baselines on monolingual and multilingual intent classification.
We also create new benchmark datasets for speech summarization and low-resource/zero-shot transfer from English to French or Spanish.
arXiv Detail & Related papers (2023-05-16T17:53:03Z) - Bridging Speech and Textual Pre-trained Models with Unsupervised ASR [70.61449720963235]
This work proposes a simple yet efficient unsupervised paradigm that connects speech and textual pre-trained models.
We show that unsupervised automatic speech recognition (ASR) can improve the representations from speech self-supervised models.
Notably, on spoken question answering, we reach the state-of-the-art result over the challenging NMSQA benchmark.
arXiv Detail & Related papers (2022-11-06T04:50:37Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Analyzing the factors affecting usefulness of Self-Supervised
Pre-trained Representations for Speech Recognition [1.0705399532413615]
Self-supervised learning (SSL) to learn high-level speech representations has been a popular approach to building Automatic Speech Recognition systems.
We study the effect of domain, language, dataset size, and other aspects of our upstream pre-training SSL data on the final performance low-resource downstream ASR task.
arXiv Detail & Related papers (2022-03-31T11:48:24Z) - How much pretraining data do language models need to learn syntax? [12.668478784932878]
Transformers-based pretrained language models achieve outstanding results in many well-known NLU benchmarks.
We study the impact of pretraining data size on the knowledge of the models using RoBERTa.
arXiv Detail & Related papers (2021-09-07T15:51:39Z) - Semi-Supervised Spoken Language Understanding via Self-Supervised Speech
and Language Model Pretraining [64.35907499990455]
We propose a framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech.
Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT.
In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation.
arXiv Detail & Related papers (2020-10-26T18:21:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.