Data Augmentation for Spoken Language Understanding via Pretrained
Language Models
- URL: http://arxiv.org/abs/2004.13952v2
- Date: Thu, 11 Mar 2021 01:36:00 GMT
- Title: Data Augmentation for Spoken Language Understanding via Pretrained
Language Models
- Authors: Baolin Peng, Chenguang Zhu, Michael Zeng, Jianfeng Gao
- Abstract summary: Training of spoken language understanding (SLU) models often faces the problem of data scarcity.
We put forward a data augmentation method using pretrained language models to boost the variability and accuracy of generated utterances.
- Score: 113.56329266325902
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The training of spoken language understanding (SLU) models often faces the
problem of data scarcity. In this paper, we put forward a data augmentation
method using pretrained language models to boost the variability and accuracy
of generated utterances. Furthermore, we investigate and propose solutions to
two previously overlooked semi-supervised learning scenarios of data scarcity
in SLU: i) Rich-in-Ontology: ontology information with numerous valid dialogue
acts is given; ii) Rich-in-Utterance: a large number of unlabelled utterances
are available. Empirical results show that our method can produce synthetic
training data that boosts the performance of language understanding models in
various scenarios.
Related papers
- Learning Phonotactics from Linguistic Informants [54.086544221761486]
Our model iteratively selects or synthesizes a data-point according to one of a range of information-theoretic policies.
We find that the information-theoretic policies that our model uses to select items to query the informant achieve sample efficiency comparable to, or greater than, fully supervised approaches.
arXiv Detail & Related papers (2024-05-08T00:18:56Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - Learning from Multiple Noisy Augmented Data Sets for Better
Cross-Lingual Spoken Language Understanding [69.40915115518523]
Lack of training data presents a grand challenge to scaling out spoken language understanding (SLU) to low-resource languages.
Various data augmentation approaches have been proposed to synthesize training data in low-resource target languages.
In this paper we focus on mitigating noise in augmented data.
arXiv Detail & Related papers (2021-09-03T15:44:15Z) - Augmenting Slot Values and Contexts for Spoken Language Understanding
with Pretrained Models [45.477765875738115]
Spoken Language Understanding (SLU) is one essential step in building a dialogue system.
Due to the expensive cost of obtaining the labeled data, SLU suffers from the data scarcity problem.
We propose two strategies for finetuning process: value-based and context-based augmentation.
arXiv Detail & Related papers (2021-08-19T02:52:40Z) - AuGPT: Dialogue with Pre-trained Language Models and Data Augmentation [0.0]
We introduce modified training objectives for language model finetuning.
We employ massive data augmentation via back-translation to increase the diversity of the training data.
Our model achieves state-of-the-art performance on the MultiWOZ data and shows competitive performance in human evaluation.
arXiv Detail & Related papers (2021-02-09T20:53:34Z) - Training Data Leakage Analysis in Language Models [6.843491191969066]
We introduce a methodology that investigates identifying the user content in the training data that could be leaked under a strong and realistic threat model.
We propose two metrics to quantify user-level data leakage by measuring a model's ability to produce unique sentence fragments within training data.
arXiv Detail & Related papers (2021-01-14T00:57:32Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Analysis of Predictive Coding Models for Phonemic Representation
Learning in Small Datasets [0.0]
The present study investigates the behaviour of two predictive coding models, Autoregressive Predictive Coding and Contrastive Predictive Coding, in a phoneme discrimination task.
Our experiments show a strong correlation between the autoregressive loss and the phoneme discrimination scores with the two datasets.
The CPC model shows rapid convergence already after one pass over the training data, and, on average, its representations outperform those of APC on both languages.
arXiv Detail & Related papers (2020-07-08T15:46:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.