AuGPT: Dialogue with Pre-trained Language Models and Data Augmentation
- URL: http://arxiv.org/abs/2102.05126v1
- Date: Tue, 9 Feb 2021 20:53:34 GMT
- Title: AuGPT: Dialogue with Pre-trained Language Models and Data Augmentation
- Authors: Jon\'a\v{s} Kulh\'anek and Vojt\v{e}ch Hude\v{c}ek and Tom\'a\v{s}
Nekvinda and Ond\v{r}ej Du\v{s}ek
- Abstract summary: We introduce modified training objectives for language model finetuning.
We employ massive data augmentation via back-translation to increase the diversity of the training data.
Our model achieves state-of-the-art performance on the MultiWOZ data and shows competitive performance in human evaluation.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention-based pre-trained language models such as GPT-2 brought
considerable progress to end-to-end dialogue modelling. However, they also
present considerable risks for task-oriented dialogue, such as lack of
knowledge grounding or diversity. To address these issues, we introduce
modified training objectives for language model finetuning, and we employ
massive data augmentation via back-translation to increase the diversity of the
training data. We further examine the possibilities of combining data from
multiples sources to improve performance on the target dataset. We carefully
evaluate our contributions with both human and automatic methods. Our model
achieves state-of-the-art performance on the MultiWOZ data and shows
competitive performance in human evaluation.
Related papers
- Unsupervised Data Validation Methods for Efficient Model Training [0.0]
State-of-the-art models in natural language processing (NLP), text-to-speech (TTS), speech-to-text (STT) and vision-language models (VLM) rely heavily on large datasets.
This research explores key areas such as defining "quality data," developing methods for generating appropriate data and enhancing accessibility to model training.
arXiv Detail & Related papers (2024-10-10T13:00:53Z) - Towards Better Instruction Following Language Models for Chinese:
Investigating the Impact of Training Data and Evaluation [12.86275938443485]
We examine the influence of training data factors, including quantity, quality, and linguistic distribution, on model performance.
We assess various models using a evaluation set of 1,000 samples, encompassing nine real-world scenarios.
We extend the vocabulary of LLaMA - the model with the closest open-source performance to proprietary language models like GPT-3.
arXiv Detail & Related papers (2023-04-16T18:37:39Z) - OCHADAI at SemEval-2022 Task 2: Adversarial Training for Multilingual
Idiomaticity Detection [4.111899441919165]
We propose a multilingual adversarial training model for determining whether a sentence contains an idiomatic expression.
Our model relies on pre-trained contextual representations from different multi-lingual state-of-the-art transformer-based language models.
arXiv Detail & Related papers (2022-06-07T05:52:43Z) - A Model-Agnostic Data Manipulation Method for Persona-based Dialogue
Generation [107.82729587882397]
It is expensive to scale up current persona-based dialogue datasets.
Each data sample in this task is more complex to learn with than conventional dialogue data.
We propose a data manipulation method, which is model-agnostic to be packed with any persona-based dialogue generation model.
arXiv Detail & Related papers (2022-04-21T03:49:54Z) - Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation [70.81596088969378]
Cross-lingual Outline-based Dialogue dataset (termed COD) enables natural language understanding.
COD enables dialogue state tracking, and end-to-end dialogue modelling and evaluation in 4 diverse languages.
arXiv Detail & Related papers (2022-01-31T18:11:21Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - Training Data Leakage Analysis in Language Models [6.843491191969066]
We introduce a methodology that investigates identifying the user content in the training data that could be leaked under a strong and realistic threat model.
We propose two metrics to quantify user-level data leakage by measuring a model's ability to produce unique sentence fragments within training data.
arXiv Detail & Related papers (2021-01-14T00:57:32Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Enhancing Dialogue Generation via Multi-Level Contrastive Learning [57.005432249952406]
We propose a multi-level contrastive learning paradigm to model the fine-grained quality of the responses with respect to the query.
A Rank-aware (RC) network is designed to construct the multi-level contrastive optimization objectives.
We build a Knowledge Inference (KI) component to capture the keyword knowledge from the reference during training and exploit such information to encourage the generation of informative words.
arXiv Detail & Related papers (2020-09-19T02:41:04Z) - Data Augmentation for Spoken Language Understanding via Pretrained
Language Models [113.56329266325902]
Training of spoken language understanding (SLU) models often faces the problem of data scarcity.
We put forward a data augmentation method using pretrained language models to boost the variability and accuracy of generated utterances.
arXiv Detail & Related papers (2020-04-29T04:07:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.