Towards Reducing the Need for Speech Training Data To Build Spoken
Language Understanding Systems
- URL: http://arxiv.org/abs/2203.00006v1
- Date: Sat, 26 Feb 2022 15:21:13 GMT
- Title: Towards Reducing the Need for Speech Training Data To Build Spoken
Language Understanding Systems
- Authors: Samuel Thomas, Hong-Kwang J. Kuo, Brian Kingsbury, George Saon
- Abstract summary: Large amounts of text data with suitable labels are usually available.
We propose a novel text representation and training methodology that allows E2E SLU systems to be effectively constructed using these text resources.
- Score: 29.256853083988634
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The lack of speech data annotated with labels required for spoken language
understanding (SLU) is often a major hurdle in building end-to-end (E2E)
systems that can directly process speech inputs. In contrast, large amounts of
text data with suitable labels are usually available. In this paper, we propose
a novel text representation and training methodology that allows E2E SLU
systems to be effectively constructed using these text resources. With very
limited amounts of additional speech, we show that these models can be further
improved to perform at levels close to similar systems built on the full speech
datasets. The efficacy of our proposed approach is demonstrated on both intent
and entity tasks using three different SLU datasets. With text-only training,
the proposed system achieves up to 90% of the performance possible with full
speech training. With just an additional 10% of speech data, these models
significantly improve further to 97% of full performance.
Related papers
- Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)
We present a simple yet effective automatic process for creating speech-text pair data.
Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection [49.27067541740956]
Speech Emotion Recognition (SER) is a crucial component in developing general-purpose AI agents capable of natural human-computer interaction.
Building robust multilingual SER systems remains challenging due to the scarcity of labeled data in languages other than English and Chinese.
We propose an approach to enhance SER performance in low SER resource languages by leveraging data from high-resource languages.
arXiv Detail & Related papers (2024-09-17T08:36:45Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - Speak or Chat with Me: End-to-End Spoken Language Understanding System
with Flexible Inputs [21.658650440278063]
We propose a novel system that can predict intents from flexible types of inputs: speech, ASR transcripts, or both.
Our experiments show significant advantages for these pre-training and fine-tuning strategies, resulting in a system that achieves competitive intent-classification performance.
arXiv Detail & Related papers (2021-04-07T20:48:08Z) - Exploring Transfer Learning For End-to-End Spoken Language Understanding [8.317084844841323]
An end-to-end (E2E) system that goes directly from speech to a hypothesis is a more attractive option.
We propose an E2E system that is designed to jointly train on multiple speech-to-text tasks.
We show that it beats the performance of E2E models trained on individual tasks.
arXiv Detail & Related papers (2020-12-15T19:02:15Z) - Leveraging Unpaired Text Data for Training End-to-End Speech-to-Intent
Systems [39.79749518035203]
Training an end-to-end (E2E) neural network speech-to-intent system that directly extracts intents from speech requires large amounts of intent-labeled speech data.
We implement a CTC-based S2I system that matches the performance of a state-of-the-art, traditional cascaded SLU system.
arXiv Detail & Related papers (2020-10-08T22:16:26Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.