ALP: Data Augmentation using Lexicalized PCFGs for Few-Shot Text
Classification
- URL: http://arxiv.org/abs/2112.11916v1
- Date: Thu, 16 Dec 2021 09:56:35 GMT
- Title: ALP: Data Augmentation using Lexicalized PCFGs for Few-Shot Text
Classification
- Authors: Hazel Kim, Daecheol Woo, Seong Joon Oh, Jeong-Won Cha, Yo-Sub Han
- Abstract summary: We present the data augmentation using Lexicalized Probabilistic context-free grammars (ALP)
Experiments on few-shot text classification tasks demonstrate that ALP enhances many state-of-the-art classification methods.
We argue empirically that the traditional splitting of training and validation sets is sub-optimal compared to our novel augmentation-based splitting strategies.
- Score: 11.742065170002162
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Data augmentation has been an important ingredient for boosting performances
of learned models. Prior data augmentation methods for few-shot text
classification have led to great performance boosts. However, they have not
been designed to capture the intricate compositional structure of natural
language. As a result, they fail to generate samples with plausible and diverse
sentence structures. Motivated by this, we present the data Augmentation using
Lexicalized Probabilistic context-free grammars (ALP) that generates augmented
samples with diverse syntactic structures with plausible grammar. The
lexicalized PCFG parse trees consider both the constituents and dependencies to
produce a syntactic frame that maximizes a variety of word choices in a
syntactically preservable manner without specific domain experts. Experiments
on few-shot text classification tasks demonstrate that ALP enhances many
state-of-the-art classification methods. As a second contribution, we delve
into the train-val splitting methodologies when a data augmentation method
comes into play. We argue empirically that the traditional splitting of
training and validation sets is sub-optimal compared to our novel
augmentation-based splitting strategies that further expand the training split
with the same number of labeled data. Taken together, our contributions on the
data augmentation strategies yield a strong training recipe for few-shot text
classification tasks.
Related papers
- Evaluating LLM Prompts for Data Augmentation in Multi-label Classification of Ecological Texts [1.565361244756411]
Large language models (LLMs) play a crucial role in natural language processing (NLP) tasks.
This study applied prompt-based data augmentation to detect mentions of green practices in Russian social media.
arXiv Detail & Related papers (2024-11-22T12:37:41Z) - Pointer-Guided Pre-Training: Infusing Large Language Models with Paragraph-Level Contextual Awareness [3.2925222641796554]
"pointer-guided segment ordering" (SO) is a novel pre-training technique aimed at enhancing the contextual understanding of paragraph-level text representations.
Our experiments show that pointer-guided pre-training significantly enhances the model's ability to understand complex document structures.
arXiv Detail & Related papers (2024-06-06T15:17:51Z) - Boosting Event Extraction with Denoised Structure-to-Text Augmentation [52.21703002404442]
Event extraction aims to recognize pre-defined event triggers and arguments from texts.
Recent data augmentation methods often neglect the problem of grammatical incorrectness.
We propose a denoised structure-to-text augmentation framework for event extraction DAEE.
arXiv Detail & Related papers (2023-05-16T16:52:07Z) - Selective Text Augmentation with Word Roles for Low-Resource Text
Classification [3.4806267677524896]
Different words may play different roles in text classification, which inspires us to strategically select the proper roles for text augmentation.
In this work, we first identify the relationships between the words in a text and the text category from the perspectives of statistical correlation and semantic similarity.
We present a new augmentation technique called STA (Selective Text Augmentation) where different text-editing operations are selectively applied to words with specific roles.
arXiv Detail & Related papers (2022-09-04T08:13:11Z) - TreeMix: Compositional Constituency-based Data Augmentation for Natural
Language Understanding [56.794981024301094]
We propose a compositional data augmentation approach for natural language understanding called TreeMix.
Specifically, TreeMix leverages constituency parsing tree to decompose sentences into constituent sub-structures and the Mixup data augmentation technique to recombine them to generate new sentences.
Compared with previous approaches, TreeMix introduces greater diversity to the samples generated and encourages models to learn compositionality of NLP data.
arXiv Detail & Related papers (2022-05-12T15:25:12Z) - To Augment or Not to Augment? A Comparative Study on Text Augmentation
Techniques for Low-Resource NLP [0.0]
We investigate three categories of text augmentation methodologies which perform changes on the syntax.
We compare them on part-of-speech tagging, dependency parsing and semantic role labeling for a diverse set of language families.
Our results suggest that the augmentation techniques can further improve over strong baselines based on mBERT.
arXiv Detail & Related papers (2021-11-18T10:52:48Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - Learning to Synthesize Data for Semantic Parsing [57.190817162674875]
We propose a generative model which models the composition of programs and maps a program to an utterance.
Due to the simplicity of PCFG and pre-trained BART, our generative model can be efficiently learned from existing data at hand.
We evaluate our method in both in-domain and out-of-domain settings of text-to-Query parsing on the standard benchmarks of GeoQuery and Spider.
arXiv Detail & Related papers (2021-04-12T21:24:02Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z) - A Comparative Study on Structural and Semantic Properties of Sentence
Embeddings [77.34726150561087]
We propose a set of experiments using a widely-used large-scale data set for relation extraction.
We show that different embedding spaces have different degrees of strength for the structural and semantic properties.
These results provide useful information for developing embedding-based relation extraction methods.
arXiv Detail & Related papers (2020-09-23T15:45:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.