NL-Augmenter: A Framework for Task-Sensitive Natural Language
Augmentation
- URL: http://arxiv.org/abs/2112.02721v1
- Date: Mon, 6 Dec 2021 00:37:59 GMT
- Title: NL-Augmenter: A Framework for Task-Sensitive Natural Language
Augmentation
- Authors: Kaustubh D. Dhole, Varun Gangal, Sebastian Gehrmann, Aadesh Gupta,
Zhenhao Li, Saad Mahamood, Abinaya Mahendiran, Simon Mille, Ashish
Srivastava, Samson Tan, Tongshuang Wu, Jascha Sohl-Dickstein, Jinho D. Choi,
Eduard Hovy, Ondrej Dusek, Sebastian Ruder, Sajant Anand, Nagender Aneja,
Rabin Banjade, Lisa Barthe, Hanna Behnke, Ian Berlot-Attwell, Connor Boyle,
Caroline Brun, Marco Antonio Sobrevilla Cabezudo, Samuel Cahyawijaya, Emile
Chapuis, Wanxiang Che, Mukund Choudhary, Christian Clauss, Pierre Colombo,
Filip Cornell, Gautier Dagan, Mayukh Das, Tanay Dixit, Thomas Dopierre,
Paul-Alexis Dray, Suchitra Dubey, Tatiana Ekeinhor, Marco Di Giovanni,
Rishabh Gupta, Rishabh Gupta, Louanes Hamla, Sang Han, Fabrice Harel-Canada,
Antoine Honore, Ishan Jindal, Przemyslaw K. Joniak, Denis Kleyko, Venelin
Kovatchev, Kalpesh Krishna, Ashutosh Kumar, Stefan Langer, Seungjae Ryan Lee,
Corey James Levinson, Hualou Liang, Kaizhao Liang, Zhexiong Liu, Andrey
Lukyanenko, Vukosi Marivate, Gerard de Melo, Simon Meoni, Maxime Meyer, Afnan
Mir, Nafise Sadat Moosavi, Niklas Muennighoff, Timothy Sum Hon Mun, Kenton
Murray, Marcin Namysl, Maria Obedkova, Priti Oli, Nivranshu Pasricha, Jan
Pfister, Richard Plant, Vinay Prabhu, Vasile Pais, Libo Qin, Shahab Raji,
Pawan Kumar Rajpoot, Vikas Raunak, Roy Rinberg, Nicolas Roberts, Juan Diego
Rodriguez, Claude Roux, Vasconcellos P. H. S., Ananya B. Sai, Robin M.
Schmidt, Thomas Scialom, Tshephisho Sefara, Saqib N. Shamsi, Xudong Shen,
Haoyue Shi, Yiwen Shi, Anna Shvets, Nick Siegel, Damien Sileo, Jamie Simon,
Chandan Singh, Roman Sitelew, Priyank Soni, Taylor Sorensen, William Soto,
Aman Srivastava, KV Aditya Srivatsa, Tony Sun, Mukund Varma T, A Tabassum,
Fiona Anting Tan, Ryan Teehan, Mo Tiwari, Marie Tolkiehn, Athena Wang, Zijian
Wang, Gloria Wang, Zijie J. Wang, Fuxuan Wei, Bryan Wilie, Genta Indra
Winata, Xinyi Wu, Witold Wydma\'nski, Tianbao Xie, Usama Yaseen, M. Yee, Jing
Zhang, Yue Zhang
- Abstract summary: We present NL-Augmenter, a new participatory Python-based natural language augmentation framework.
We describe the framework and an initial set of 117 transformations and 23 filters for a variety of natural language tasks.
We demonstrate the efficacy of NL-Augmenter by using several of its transformations to analyze the robustness of popular natural language models.
- Score: 91.97706178867439
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data augmentation is an important component in the robustness evaluation of
models in natural language processing (NLP) and in enhancing the diversity of
the data they are trained on. In this paper, we present NL-Augmenter, a new
participatory Python-based natural language augmentation framework which
supports the creation of both transformations (modifications to the data) and
filters (data splits according to specific features). We describe the framework
and an initial set of 117 transformations and 23 filters for a variety of
natural language tasks. We demonstrate the efficacy of NL-Augmenter by using
several of its transformations to analyze the robustness of popular natural
language models. The infrastructure, datacards and robustness analysis results
are available publicly on the NL-Augmenter repository
(\url{https://github.com/GEM-benchmark/NL-Augmenter}).
Related papers
- Improving Generalization in Semantic Parsing by Increasing Natural
Language Variation [67.13483734810852]
In this work, we use data augmentation to enhance robustness of text-to- semantic parsing.
We leverage the capabilities of large language models to generate more realistic and diverse questions.
Using only a few prompts, we achieve a two-fold increase in the number of questions in Spider.
arXiv Detail & Related papers (2024-02-13T18:48:23Z) - Improving Domain-Specific Retrieval by NLI Fine-Tuning [64.79760042717822]
This article investigates the fine-tuning potential of natural language inference (NLI) data to improve information retrieval and ranking.
We employ both monolingual and multilingual sentence encoders fine-tuned by a supervised method utilizing contrastive loss and NLI data.
Our results point to the fact that NLI fine-tuning increases the performance of the models in both tasks and both languages, with the potential to improve mono- and multilingual models.
arXiv Detail & Related papers (2023-08-06T12:40:58Z) - BRENT: Bidirectional Retrieval Enhanced Norwegian Transformer [1.911678487931003]
Retrieval-based language models are increasingly employed in question-answering tasks.
We develop the first Norwegian retrieval-based model by adapting the REALM framework.
We show that this type of training improves the reader's performance on extractive question-answering.
arXiv Detail & Related papers (2023-04-19T13:40:47Z) - Multi-Scales Data Augmentation Approach In Natural Language Inference
For Artifacts Mitigation And Pre-Trained Model Optimization [0.0]
We provide a variety of techniques for analyzing and locating dataset artifacts inside the crowdsourced Stanford Natural Language Inference corpus.
To mitigate dataset artifacts, we employ a unique multi-scale data augmentation technique with two distinct frameworks.
Our combination method enhances our model's resistance to perturbation testing, enabling it to continuously outperform the pre-trained baseline.
arXiv Detail & Related papers (2022-12-16T23:37:44Z) - Exploring the State-of-the-Art Language Modeling Methods and Data
Augmentation Techniques for Multilingual Clause-Level Morphology [3.8498574327875947]
We present our work on all three parts of the shared task: inflection, reinflection, and analysis.
We mainly explore two approaches: Transformer models in combination with data augmentation, and exploiting the state-of-the-art language modeling techniques for morphological analysis.
Our methods achieved first place in each of the three tasks and outperforms mT5-baseline with 89% for inflection, 80% for reinflection and 12% for analysis.
arXiv Detail & Related papers (2022-11-03T11:53:39Z) - WANLI: Worker and AI Collaboration for Natural Language Inference
Dataset Creation [101.00109827301235]
We introduce a novel paradigm for dataset creation based on human and machine collaboration.
We use dataset cartography to automatically identify examples that demonstrate challenging reasoning patterns, and instruct GPT-3 to compose new examples with similar patterns.
The resulting dataset, WANLI, consists of 108,357 natural language inference (NLI) examples that present unique empirical strengths.
arXiv Detail & Related papers (2022-01-16T03:13:49Z) - Enhancing deep neural networks with morphological information [0.0]
We analyse the effect of adding morphological features to LSTM and BERT models.
Our results suggest that adding morphological features has mixed effects depending on the quality of features and the task.
arXiv Detail & Related papers (2020-11-24T22:35:44Z) - Exploring Software Naturalness through Neural Language Models [56.1315223210742]
The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing.
We explore this hypothesis through the use of a pre-trained transformer-based language model to perform code analysis tasks.
arXiv Detail & Related papers (2020-06-22T21:56:14Z) - Dynamic Data Selection and Weighting for Iterative Back-Translation [116.14378571769045]
We propose a curriculum learning strategy for iterative back-translation models.
We evaluate our models on domain adaptation, low-resource, and high-resource MT settings.
Experimental results demonstrate that our methods achieve improvements of up to 1.8 BLEU points over competitive baselines.
arXiv Detail & Related papers (2020-04-07T19:49:58Z) - Stochastic Natural Language Generation Using Dependency Information [0.7995360025953929]
This article presents a corpus-based model for generating natural language text.
Our model encodes dependency relations from training data through a feature set, then produces a new dependency tree for a given meaning representation.
We show that our model produces high-quality utterances in aspects of informativeness and naturalness as well as quality.
arXiv Detail & Related papers (2020-01-12T09:40:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.