Speak or Chat with Me: End-to-End Spoken Language Understanding System
with Flexible Inputs
- URL: http://arxiv.org/abs/2104.05752v1
- Date: Wed, 7 Apr 2021 20:48:08 GMT
- Title: Speak or Chat with Me: End-to-End Spoken Language Understanding System
with Flexible Inputs
- Authors: Sujeong Cha, Wangrui Hou, Hyun Jung, My Phung, Michael Picheny,
Hong-Kwang Kuo, Samuel Thomas, Edmilson Morais
- Abstract summary: We propose a novel system that can predict intents from flexible types of inputs: speech, ASR transcripts, or both.
Our experiments show significant advantages for these pre-training and fine-tuning strategies, resulting in a system that achieves competitive intent-classification performance.
- Score: 21.658650440278063
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A major focus of recent research in spoken language understanding (SLU) has
been on the end-to-end approach where a single model can predict intents
directly from speech inputs without intermediate transcripts. However, this
approach presents some challenges. First, since speech can be considered as
personally identifiable information, in some cases only automatic speech
recognition (ASR) transcripts are accessible. Second, intent-labeled speech
data is scarce. To address the first challenge, we propose a novel system that
can predict intents from flexible types of inputs: speech, ASR transcripts, or
both. We demonstrate strong performance for either modality separately, and
when both speech and ASR transcripts are available, through system combination,
we achieve better results than using a single input modality. To address the
second challenge, we leverage a semantically robust pre-trained BERT model and
adopt a cross-modal system that co-trains text embeddings and acoustic
embeddings in a shared latent space. We further enhance this system by
utilizing an acoustic module pre-trained on LibriSpeech and domain-adapting the
text module on our target datasets. Our experiments show significant advantages
for these pre-training and fine-tuning strategies, resulting in a system that
achieves competitive intent-classification performance on Snips SLU and Fluent
Speech Commands datasets.
Related papers
- Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)
We present a simple yet effective automatic process for creating speech-text pair data.
Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - Instruction-Following Speech Recognition [21.591086644665197]
We introduce instruction-following speech recognition, training a Listen-Attend-Spell model to understand and execute a diverse set of free-form text instructions.
Remarkably, our model, trained from scratch on Librispeech, interprets and executes simple instructions without requiring Large Language Models or pre-trained speech modules.
arXiv Detail & Related papers (2023-09-18T14:59:10Z) - BLSP: Bootstrapping Language-Speech Pre-training via Behavior Alignment of Continuation Writing [35.31866559807704]
modality alignment between speech and text remains an open problem.
We propose the BLSP approach that bootstraps Language-Speech Pre-training via behavior alignment of continuation writing.
We demonstrate that this straightforward process can extend the capabilities of LLMs to speech, enabling speech recognition, speech translation, spoken language understanding, and speech conversation, even in zero-shot cross-lingual scenarios.
arXiv Detail & Related papers (2023-09-02T11:46:05Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Improving Textless Spoken Language Understanding with Discrete Units as
Intermediate Target [58.59044226658916]
Spoken Language Understanding (SLU) is a task that aims to extract semantic information from spoken utterances.
We propose to use discrete units as intermediate guidance to improve textless SLU performance.
arXiv Detail & Related papers (2023-05-29T14:00:24Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - Towards Reducing the Need for Speech Training Data To Build Spoken
Language Understanding Systems [29.256853083988634]
Large amounts of text data with suitable labels are usually available.
We propose a novel text representation and training methodology that allows E2E SLU systems to be effectively constructed using these text resources.
arXiv Detail & Related papers (2022-02-26T15:21:13Z) - SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text
Joint Pre-Training [33.02912456062474]
We build a single encoder with the BERT objective on unlabeled text together with the w2v-BERT objective on unlabeled speech.
We demonstrate that incorporating both speech and text data during pre-training can significantly improve downstream quality on CoVoST2 speech translation.
arXiv Detail & Related papers (2021-10-20T00:59:36Z) - Semi-Supervised Spoken Language Understanding via Self-Supervised Speech
and Language Model Pretraining [64.35907499990455]
We propose a framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech.
Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT.
In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation.
arXiv Detail & Related papers (2020-10-26T18:21:27Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.