Unnatural Language Processing: Bridging the Gap Between Synthetic and
Natural Language Data
- URL: http://arxiv.org/abs/2004.13645v1
- Date: Tue, 28 Apr 2020 16:41:00 GMT
- Title: Unnatural Language Processing: Bridging the Gap Between Synthetic and
Natural Language Data
- Authors: Alana Marzoev, Samuel Madden, M. Frans Kaashoek, Michael Cafarella,
Jacob Andreas
- Abstract summary: We introduce a technique for -simulation-to-real'' transfer in language understanding problems.
Our approach matches or outperforms state-of-the-art models trained on natural language data in several domains.
- Score: 37.542036032277466
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large, human-annotated datasets are central to the development of natural
language processing models. Collecting these datasets can be the most
challenging part of the development process. We address this problem by
introducing a general purpose technique for ``simulation-to-real'' transfer in
language understanding problems with a delimited set of target behaviors,
making it possible to develop models that can interpret natural utterances
without natural training data. We begin with a synthetic data generation
procedure, and train a model that can accurately interpret utterances produced
by the data generator. To generalize to natural utterances, we automatically
find projections of natural language utterances onto the support of the
synthetic language, using learned sentence embeddings to define a distance
metric. With only synthetic training data, our approach matches or outperforms
state-of-the-art models trained on natural language data in several domains.
These results suggest that simulation-to-real transfer is a practical framework
for developing NLP applications, and that improved models for transfer might
provide wide-ranging improvements in downstream tasks.
Related papers
- ViANLI: Adversarial Natural Language Inference for Vietnamese [1.907126872483548]
We introduce the adversarial NLI dataset to the NLP research community with the name ViANLI.
This data set contains more than 10K premise-hypothesis pairs.
The accuracy of the most powerful model on the test set only reached 48.4%.
arXiv Detail & Related papers (2024-06-25T16:58:19Z) - Controlled Randomness Improves the Performance of Transformer Models [4.678970068275123]
We introduce controlled randomness, i.e. noise, into the training process to improve fine-tuning language models.
We find that adding such noise can improve the performance in our two downstream tasks of joint named entity recognition and relation extraction and text summarization.
arXiv Detail & Related papers (2023-10-20T14:12:55Z) - Multi-Scales Data Augmentation Approach In Natural Language Inference
For Artifacts Mitigation And Pre-Trained Model Optimization [0.0]
We provide a variety of techniques for analyzing and locating dataset artifacts inside the crowdsourced Stanford Natural Language Inference corpus.
To mitigate dataset artifacts, we employ a unique multi-scale data augmentation technique with two distinct frameworks.
Our combination method enhances our model's resistance to perturbation testing, enabling it to continuously outperform the pre-trained baseline.
arXiv Detail & Related papers (2022-12-16T23:37:44Z) - Exploring Transitivity in Neural NLI Models through Veridicality [39.845425535943534]
We focus on the transitivity of inference relations, a fundamental property for systematically drawing inferences.
A model capturing transitivity can compose basic inference patterns and draw new inferences.
We find that current NLI models do not perform consistently well on transitivity inference tasks.
arXiv Detail & Related papers (2021-01-26T11:18:35Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z) - Pre-Training a Language Model Without Human Language [74.11825654535895]
We study how the intrinsic nature of pre-training data contributes to the fine-tuned downstream performance.
We find that models pre-trained on unstructured data beat those trained directly from scratch on downstream tasks.
To our great astonishment, we uncover that pre-training on certain non-human language data gives GLUE performance close to performance pre-trained on another non-English language.
arXiv Detail & Related papers (2020-12-22T13:38:06Z) - Reprogramming Language Models for Molecular Representation Learning [65.00999660425731]
We propose Representation Reprogramming via Dictionary Learning (R2DL) for adversarially reprogramming pretrained language models for molecular learning tasks.
The adversarial program learns a linear transformation between a dense source model input space (language data) and a sparse target model input space (e.g., chemical and biological molecule data) using a k-SVD solver.
R2DL achieves the baseline established by state of the art toxicity prediction models trained on domain-specific data and outperforms the baseline in a limited training-data setting.
arXiv Detail & Related papers (2020-12-07T05:50:27Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Data Augmentation for Spoken Language Understanding via Pretrained
Language Models [113.56329266325902]
Training of spoken language understanding (SLU) models often faces the problem of data scarcity.
We put forward a data augmentation method using pretrained language models to boost the variability and accuracy of generated utterances.
arXiv Detail & Related papers (2020-04-29T04:07:12Z) - Stochastic Natural Language Generation Using Dependency Information [0.7995360025953929]
This article presents a corpus-based model for generating natural language text.
Our model encodes dependency relations from training data through a feature set, then produces a new dependency tree for a given meaning representation.
We show that our model produces high-quality utterances in aspects of informativeness and naturalness as well as quality.
arXiv Detail & Related papers (2020-01-12T09:40:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.