Leveraging Unpaired Text Data for Training End-to-End Speech-to-Intent
Systems
- URL: http://arxiv.org/abs/2010.04284v1
- Date: Thu, 8 Oct 2020 22:16:26 GMT
- Title: Leveraging Unpaired Text Data for Training End-to-End Speech-to-Intent
Systems
- Authors: Yinghui Huang, Hong-Kwang Kuo, Samuel Thomas, Zvi Kons, Kartik
Audhkhasi, Brian Kingsbury, Ron Hoory, Michael Picheny
- Abstract summary: Training an end-to-end (E2E) neural network speech-to-intent system that directly extracts intents from speech requires large amounts of intent-labeled speech data.
We implement a CTC-based S2I system that matches the performance of a state-of-the-art, traditional cascaded SLU system.
- Score: 39.79749518035203
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training an end-to-end (E2E) neural network speech-to-intent (S2I) system
that directly extracts intents from speech requires large amounts of
intent-labeled speech data, which is time consuming and expensive to collect.
Initializing the S2I model with an ASR model trained on copious speech data can
alleviate data sparsity. In this paper, we attempt to leverage NLU text
resources. We implemented a CTC-based S2I system that matches the performance
of a state-of-the-art, traditional cascaded SLU system. We performed controlled
experiments with varying amounts of speech and text training data. When only a
tenth of the original data is available, intent classification accuracy
degrades by 7.6% absolute. Assuming we have additional text-to-intent data
(without speech) available, we investigated two techniques to improve the S2I
system: (1) transfer learning, in which acoustic embeddings for intent
classification are tied to fine-tuned BERT text embeddings; and (2) data
augmentation, in which the text-to-intent data is converted into
speech-to-intent data using a multi-speaker text-to-speech system. The proposed
approaches recover 80% of performance lost due to using limited intent-labeled
speech.
Related papers
- Leveraging Large Text Corpora for End-to-End Speech Summarization [58.673480990374635]
End-to-end speech summarization (E2E SSum) is a technique to directly generate summary sentences from speech.
We present two novel methods that leverage a large amount of external text summarization data for E2E SSum training.
arXiv Detail & Related papers (2023-03-02T05:19:49Z) - Joint Pre-Training with Speech and Bilingual Text for Direct Speech to
Speech Translation [94.80029087828888]
Direct speech-to-speech translation (S2ST) is an attractive research topic with many advantages compared to cascaded S2ST.
Direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare.
We propose in this paper a Speech2S model, which is jointly pre-trained with unpaired speech and bilingual text data for direct speech-to-speech translation tasks.
arXiv Detail & Related papers (2022-10-31T02:55:51Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Towards Reducing the Need for Speech Training Data To Build Spoken
Language Understanding Systems [29.256853083988634]
Large amounts of text data with suitable labels are usually available.
We propose a novel text representation and training methodology that allows E2E SLU systems to be effectively constructed using these text resources.
arXiv Detail & Related papers (2022-02-26T15:21:13Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z) - A study on the efficacy of model pre-training in developing neural
text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance.
It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z) - Speak or Chat with Me: End-to-End Spoken Language Understanding System
with Flexible Inputs [21.658650440278063]
We propose a novel system that can predict intents from flexible types of inputs: speech, ASR transcripts, or both.
Our experiments show significant advantages for these pre-training and fine-tuning strategies, resulting in a system that achieves competitive intent-classification performance.
arXiv Detail & Related papers (2021-04-07T20:48:08Z) - Exploring Transfer Learning For End-to-End Spoken Language Understanding [8.317084844841323]
An end-to-end (E2E) system that goes directly from speech to a hypothesis is a more attractive option.
We propose an E2E system that is designed to jointly train on multiple speech-to-text tasks.
We show that it beats the performance of E2E models trained on individual tasks.
arXiv Detail & Related papers (2020-12-15T19:02:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.