Towards Semi-Supervised Semantics Understanding from Speech
- URL: http://arxiv.org/abs/2011.06195v1
- Date: Wed, 11 Nov 2020 01:48:09 GMT
- Title: Towards Semi-Supervised Semantics Understanding from Speech
- Authors: Cheng-I Lai, Jin Cao, Sravan Bodapati, Shang-Wen Li
- Abstract summary: We propose a framework to learn semantics directly from speech with semi-supervision from transcribed speech to address these.
Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT, and fine-tuned on a limited amount of target SLU corpus.
- Score: 15.672850567147854
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Much recent work on Spoken Language Understanding (SLU) falls short in at
least one of three ways: models were trained on oracle text input and neglected
the Automatics Speech Recognition (ASR) outputs, models were trained to predict
only intents without the slot values, or models were trained on a large amount
of in-house data. We proposed a clean and general framework to learn semantics
directly from speech with semi-supervision from transcribed speech to address
these. Our framework is built upon pretrained end-to-end (E2E) ASR and
self-supervised language models, such as BERT, and fine-tuned on a limited
amount of target SLU corpus. In parallel, we identified two inadequate settings
under which SLU models have been tested: noise-robustness and E2E semantics
evaluation. We tested the proposed framework under realistic environmental
noises and with a new metric, the slots edit F1 score, on two public SLU
corpora. Experiments show that our SLU framework with speech as input can
perform on par with those with oracle text as input in semantics understanding,
while environmental noises are present, and a limited amount of labeled
semantics data is available.
Related papers
- Improving Textless Spoken Language Understanding with Discrete Units as
Intermediate Target [58.59044226658916]
Spoken Language Understanding (SLU) is a task that aims to extract semantic information from spoken utterances.
We propose to use discrete units as intermediate guidance to improve textless SLU performance.
arXiv Detail & Related papers (2023-05-29T14:00:24Z) - Bridging Speech and Textual Pre-trained Models with Unsupervised ASR [70.61449720963235]
This work proposes a simple yet efficient unsupervised paradigm that connects speech and textual pre-trained models.
We show that unsupervised automatic speech recognition (ASR) can improve the representations from speech self-supervised models.
Notably, on spoken question answering, we reach the state-of-the-art result over the challenging NMSQA benchmark.
arXiv Detail & Related papers (2022-11-06T04:50:37Z) - Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z) - Pre-training for Spoken Language Understanding with Joint Textual and
Phonetic Representation Learning [4.327558819000435]
We propose a novel joint textual-phonetic pre-training approach for learning spoken language representations.
Experimental results on spoken language understanding benchmarks, Fluent Speech Commands and SNIPS, show that the proposed approach significantly outperforms strong baseline models.
arXiv Detail & Related papers (2021-04-21T05:19:13Z) - RNN Transducer Models For Spoken Language Understanding [49.07149742835825]
We show how RNN-T SLU models can be developed starting from pre-trained automatic speech recognition systems.
In settings where real audio data is not available, artificially synthesized speech is used to successfully adapt various SLU models.
arXiv Detail & Related papers (2021-04-08T15:35:22Z) - Speech-language Pre-training for End-to-end Spoken Language
Understanding [18.548949994603213]
We propose to unify a well-optimized E2E ASR encoder (speech) and a pre-trained language model encoder (language) into a transformer decoder.
The experimental results on two public corpora show that our approach to E2E SLU is superior to the conventional cascaded method.
arXiv Detail & Related papers (2021-02-11T21:55:48Z) - Semi-Supervised Spoken Language Understanding via Self-Supervised Speech
and Language Model Pretraining [64.35907499990455]
We propose a framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech.
Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT.
In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation.
arXiv Detail & Related papers (2020-10-26T18:21:27Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.