Exploring Transfer Learning For End-to-End Spoken Language Understanding
- URL: http://arxiv.org/abs/2012.08549v1
- Date: Tue, 15 Dec 2020 19:02:15 GMT
- Title: Exploring Transfer Learning For End-to-End Spoken Language Understanding
- Authors: Subendhu Rongali, Beiye Liu, Liwei Cai, Konstantine Arkoudas, Chengwei
Su, and Wael Hamza
- Abstract summary: An end-to-end (E2E) system that goes directly from speech to a hypothesis is a more attractive option.
We propose an E2E system that is designed to jointly train on multiple speech-to-text tasks.
We show that it beats the performance of E2E models trained on individual tasks.
- Score: 8.317084844841323
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Voice Assistants such as Alexa, Siri, and Google Assistant typically use a
two-stage Spoken Language Understanding pipeline; first, an Automatic Speech
Recognition (ASR) component to process customer speech and generate text
transcriptions, followed by a Natural Language Understanding (NLU) component to
map transcriptions to an actionable hypothesis. An end-to-end (E2E) system that
goes directly from speech to a hypothesis is a more attractive option. These
systems were shown to be smaller, faster, and better optimized. However, they
require massive amounts of end-to-end training data and in addition, don't take
advantage of the already available ASR and NLU training data.
In this work, we propose an E2E system that is designed to jointly train on
multiple speech-to-text tasks, such as ASR (speech-transcription) and SLU
(speech-hypothesis), and text-to-text tasks, such as NLU (text-hypothesis). We
call this the Audio-Text All-Task (AT-AT) Model and we show that it beats the
performance of E2E models trained on individual tasks, especially ones trained
on limited data. We show this result on an internal music dataset and two
public datasets, FluentSpeech and SNIPS Audio, where we achieve
state-of-the-art results. Since our model can process both speech and text
input sequences and learn to predict a target sequence, it also allows us to do
zero-shot E2E SLU by training on only text-hypothesis data (without any speech)
from a new domain. We evaluate this ability of our model on the Facebook TOP
dataset and set a new benchmark for zeroshot E2E performance. We will soon
release the audio data collected for the TOP dataset for future research.
Related papers
- Distilling an End-to-End Voice Assistant Without Instruction Training Data [53.524071162124464]
Distilled Voice Assistant (DiVA) generalizes to Question Answering, Classification, and Translation.
We show that DiVA better meets user preferences, achieving a 72% win rate compared with state-of-the-art models like Qwen 2 Audio.
arXiv Detail & Related papers (2024-10-03T17:04:48Z) - Improving End-to-End Speech Processing by Efficient Text Data
Utilization with Latent Synthesis [17.604583337593677]
Training a high performance end-to-end speech (E2E) processing model requires an enormous amount of labeled speech data.
We propose Latent Synthesis (LaSyn), an efficient textual data utilization framework for E2E speech processing models.
arXiv Detail & Related papers (2023-10-09T03:10:49Z) - Leveraging Large Text Corpora for End-to-End Speech Summarization [58.673480990374635]
End-to-end speech summarization (E2E SSum) is a technique to directly generate summary sentences from speech.
We present two novel methods that leverage a large amount of external text summarization data for E2E SSum training.
arXiv Detail & Related papers (2023-03-02T05:19:49Z) - ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition [100.30565531246165]
Speech recognition systems require dataset-specific tuning.
This tuning requirement can lead to systems failing to generalise to other datasets and domains.
We introduce the End-to-end Speech Benchmark (ESB) for evaluating the performance of a single automatic speech recognition system.
arXiv Detail & Related papers (2022-10-24T15:58:48Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - End-to-End Spoken Language Understanding: Performance analyses of a
voice command task in a low resource setting [0.3867363075280543]
We present a study identifying the signal features and other linguistic properties used by an E2E model to perform the Spoken Language Understanding task.
The study is carried out in the application domain of a smart home that has to handle non-English (here French) voice commands.
arXiv Detail & Related papers (2022-07-17T13:51:56Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to
Speech [7.476901945542385]
We present end-to-end text-to-speech (E2E-TTS) model which has a simplified training pipeline and outperforms a cascade of separately learned models.
Our proposed model is jointly trained FastSpeech2 and HiFi-GAN with an alignment module.
Experiments on LJSpeech corpus shows that the proposed model outperforms publicly available, state-of-the-art implementations of ESPNet2-TTS.
arXiv Detail & Related papers (2022-03-31T07:25:11Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.