LINDA: Unsupervised Learning to Interpolate in Natural Language
Processing
- URL: http://arxiv.org/abs/2112.13969v1
- Date: Tue, 28 Dec 2021 02:56:41 GMT
- Title: LINDA: Unsupervised Learning to Interpolate in Natural Language
Processing
- Authors: Yekyung Kim, Seohyeong Jeong, Kyunghyun Cho
- Abstract summary: "Learning to INterpolate for Data Augmentation" (LINDA) is an unsupervised learning approach to text for the purpose of data augmentation.
LINDA learns to interpolate between any pair of natural language sentences over a natural language manifold.
We show LINDA allows indeed us to seamlessly apply mixup in NLP and leads to better generalization in text classification both in-domain and out-of-domain.
- Score: 46.080523939647385
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the success of mixup in data augmentation, its applicability to
natural language processing (NLP) tasks has been limited due to the discrete
and variable-length nature of natural languages. Recent studies have thus
relied on domain-specific heuristics and manually crafted resources, such as
dictionaries, in order to apply mixup in NLP. In this paper, we instead propose
an unsupervised learning approach to text interpolation for the purpose of data
augmentation, to which we refer as "Learning to INterpolate for Data
Augmentation" (LINDA), that does not require any heuristics nor manually
crafted resources but learns to interpolate between any pair of natural
language sentences over a natural language manifold. After empirically
demonstrating the LINDA's interpolation capability, we show that LINDA indeed
allows us to seamlessly apply mixup in NLP and leads to better generalization
in text classification both in-domain and out-of-domain.
Related papers
- UniPSDA: Unsupervised Pseudo Semantic Data Augmentation for Zero-Shot Cross-Lingual Natural Language Understanding [31.272603877215733]
Cross-lingual representation learning transfers knowledge from resource-rich data to resource-scarce ones to improve the semantic understanding abilities of different languages.
We propose an Unsupervised Pseudo Semantic Data Augmentation (UniPSDA) mechanism for cross-lingual natural language understanding to enrich the training data without human interventions.
arXiv Detail & Related papers (2024-06-24T07:27:01Z) - A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Improving Natural Language Inference in Arabic using Transformer Models
and Linguistically Informed Pre-Training [0.34998703934432673]
This paper addresses the classification of Arabic text data in the field of Natural Language Processing (NLP)
To overcome this limitation, we create a dedicated data set from publicly available resources.
We find that a language-specific model (AraBERT) performs competitively with state-of-the-art multilingual approaches.
arXiv Detail & Related papers (2023-07-27T07:40:11Z) - A Multi-level Supervised Contrastive Learning Framework for Low-Resource
Natural Language Inference [54.678516076366506]
Natural Language Inference (NLI) is a growingly essential task in natural language understanding.
Here we propose a multi-level supervised contrastive learning framework named MultiSCL for low-resource natural language inference.
arXiv Detail & Related papers (2022-05-31T05:54:18Z) - Natural Language Generation Using Link Grammar for General
Conversational Intelligence [0.0]
We propose a new technique to automatically generate grammatically valid sentences using the Link Grammar database.
This natural language generation method far outperforms current state-of-the-art baselines and may serve as the final component in a proto-AGI question answering pipeline.
arXiv Detail & Related papers (2021-04-19T06:16:07Z) - Structured Prediction as Translation between Augmented Natural Languages [109.50236248762877]
We propose a new framework, Translation between Augmented Natural Languages (TANL), to solve many structured prediction language tasks.
Instead of tackling the problem by training task-specific discriminatives, we frame it as a translation task between augmented natural languages.
Our approach can match or outperform task-specific models on all tasks, and in particular, achieves new state-of-the-art results on joint entity and relation extraction.
arXiv Detail & Related papers (2021-01-14T18:32:21Z) - Cross-lingual Dependency Parsing as Domain Adaptation [48.69930912510414]
Cross-lingual transfer learning is as essential as in-domain learning.
We use the ability of a pre-training task that extracts universal features without supervision.
We combine the traditional self-training and the two pre-training tasks.
arXiv Detail & Related papers (2020-12-24T08:14:36Z) - OCNLI: Original Chinese Natural Language Inference [21.540733910984006]
We present the first large-scale NLI dataset (consisting of 56,000 annotated sentence pairs) for Chinese called the Original Chinese Natural Language Inference dataset (OCNLI)
Unlike recent attempts at extending NLI to other languages, our dataset does not rely on any automatic translation or non-expert annotation.
We establish several baseline results on our dataset using state-of-the-art pre-trained models for Chinese, and find even the best performing models to be far outpaced by human performance.
arXiv Detail & Related papers (2020-10-12T04:25:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.