Related papers: CoDa: Constrained Generation based Data Augmentation for Low-Resource NLP

CoDa: Constrained Generation based Data Augmentation for Low-Resource NLP

URL: http://arxiv.org/abs/2404.00415v1
Date: Sat, 30 Mar 2024 16:47:06 GMT
Title: CoDa: Constrained Generation based Data Augmentation for Low-Resource NLP
Authors: Chandra Kiran Reddy Evuru, Sreyan Ghosh, Sonal Kumar, Ramaneswaran S, Utkarsh Tyagi, Dinesh Manocha,
Abstract summary: CoDa is a controllable, effective, and training-free data augmentation technique for low-resource (data-scarce) NLP. Our approach is based on prompting off-the-shelf instruction-following Large Language Models. CoDa is the first framework that provides users explicit control over the augmentation generation process.
Score: 46.95923453967386
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present CoDa (Constrained Generation based Data Augmentation), a controllable, effective, and training-free data augmentation technique for low-resource (data-scarce) NLP. Our approach is based on prompting off-the-shelf instruction-following Large Language Models (LLMs) for generating text that satisfies a set of constraints. Precisely, we extract a set of simple constraints from every instance in the low-resource dataset and verbalize them to prompt an LLM to generate novel and diverse training instances. Our findings reveal that synthetic data that follows simple constraints in the downstream dataset act as highly effective augmentations, and CoDa can achieve this without intricate decoding-time constrained generation techniques or fine-tuning with complex algorithms that eventually make the model biased toward the small number of training instances. Additionally, CoDa is the first framework that provides users explicit control over the augmentation generation process, thereby also allowing easy adaptation to several domains. We demonstrate the effectiveness of CoDa across 11 datasets spanning 3 tasks and 3 low-resource settings. CoDa outperforms all our baselines, qualitatively and quantitatively, with improvements of 0.12%-7.19%. Code is available here: https://github.com/Sreyan88/CoDa

Related papers

Step-by-Step Mastery: Enhancing Soft Constraint Following Ability of Large Language Models [39.114513139453756]
It is crucial for large language models (LLMs) to follow instructions that involve multiple constraints. We design a pipeline to construct datasets with high-quality outputs automatically. To fully utilize the positive and negative samples generated during the data construction process, we choose Direct Preference Optimization (DPO) as the training method. We experimentally evaluate the effectiveness of our methods in improving LLMs' soft constraint following ability.
arXiv Detail & Related papers (2025-01-09T03:34:07Z)
Adaptive Dataset Quantization [2.0105434963031463]
We introduce a versatile framework for dataset compression, namely Adaptive dataset Quantization (ADQ) We propose a novel adaptive sampling strategy through the evaluation of generated bins' representativeness score, diversity score and importance score. Our method not only exhibits superior generalization capability across different architectures, but also attains state-of-the-art results, surpassing DQ by average 3% on various datasets.
arXiv Detail & Related papers (2024-12-22T07:08:29Z)
Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable. We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data. Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z)
Generating Synthetic Datasets for Few-shot Prompt Tuning [48.10054761841462]
In few-shot learning settings, prompt tuning lags far behind full-model fine-tuning, limiting its scope of application. In this paper, we leverage the powerful LLMs to synthesize task-specific labeled data for training the soft prompts. We train soft prompts on both synthetic and real datasets using a gradient surgery approach.
arXiv Detail & Related papers (2024-10-08T01:00:02Z)
Code Less, Align More: Efficient LLM Fine-tuning for Code Generation with Data Pruning [4.975728472540823]
We present techniques that integrate various clustering and pruning metrics to selectively reduce training data without compromising the accuracy and functionality of the generated code. Our experiments show that these pruning strategies not only reduce the computational resources needed but also enhance the overall quality code generation.
arXiv Detail & Related papers (2024-07-06T10:30:43Z)
Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks [66.87070857705994]
In low-resource settings, the amount of seed data samples to use for data augmentation is very small. We propose a novel method that augments training data by incorporating a wealth of examples from other datasets. This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone.
arXiv Detail & Related papers (2024-02-21T02:45:46Z)
STAR: Boosting Low-Resource Information Extraction by Structure-to-Text Data Generation with Large Language Models [56.27786433792638]
STAR is a data generation method that leverages Large Language Models (LLMs) to synthesize data instances. We design fine-grained step-by-step instructions to obtain the initial data instances. Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks.
arXiv Detail & Related papers (2023-05-24T12:15:19Z)
Mixture of Soft Prompts for Controllable Data Generation [21.84489422361048]
Mixture of Soft Prompts (MSP) is proposed as a tool for data augmentation rather than direct prediction. Our method achieves state-of-the-art results on three benchmarks when compared against strong baselines.
arXiv Detail & Related papers (2023-03-02T21:13:56Z)
SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation. Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z)
Learning to Count in the Crowd from Limited Labeled Data [109.2954525909007]
We focus on reducing the annotation efforts by learning to count in the crowd from limited number of labeled samples. Specifically, we propose a Gaussian Process-based iterative learning mechanism that involves estimation of pseudo-ground truth for the unlabeled data.
arXiv Detail & Related papers (2020-07-07T04:17:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.