PromDA: Prompt-based Data Augmentation for Low-Resource NLU Tasks
- URL: http://arxiv.org/abs/2202.12499v1
- Date: Fri, 25 Feb 2022 05:09:27 GMT
- Title: PromDA: Prompt-based Data Augmentation for Low-Resource NLU Tasks
- Authors: Yufei Wang, Can Xu, Qingfeng Sun, Huang Hu, Chongyang Tao, Xiubo Geng,
Daxin Jiang
- Abstract summary: This paper focuses on the Data Augmentation for low-resource Natural Language Understanding (NLU) tasks.
We propose Prompt-based Data Augmentation model (PromDA) which only trains small-scale Soft Prompt.
PromDA generates synthetic data via two different views and filters out the low-quality data using NLU models.
- Score: 61.51515750218049
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper focuses on the Data Augmentation for low-resource Natural Language
Understanding (NLU) tasks. We propose Prompt-based D}ata Augmentation model
(PromDA) which only trains small-scale Soft Prompt (i.e., a set of trainable
vectors) in the frozen Pre-trained Language Models (PLMs). This avoids human
effort in collecting unlabeled in-domain data and maintains the quality of
generated synthetic data. In addition, PromDA generates synthetic data via two
different views and filters out the low-quality data using NLU models.
Experiments on four benchmarks show that synthetic data produced by PromDA
successfully boost up the performance of NLU models which consistently
outperform several competitive baseline models, including a state-of-the-art
semi-supervised model using unlabeled in-domain data. The synthetic data from
PromDA are also complementary with unlabeled in-domain data. The NLU models can
be further improved when they are combined for training.
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Can Medical Vision-Language Pre-training Succeed with Purely Synthetic Data? [8.775988650381397]
Training medical vision-language pre-training models requires datasets with paired, high-quality image-text data.
Recent advancements in Large Language Models have made it possible to generate large-scale synthetic image-text pairs.
We propose an automated pipeline to build a diverse, high-quality synthetic dataset.
arXiv Detail & Related papers (2024-10-17T13:11:07Z) - Generating Synthetic Datasets for Few-shot Prompt Tuning [48.10054761841462]
In few-shot learning settings, prompt tuning lags far behind full-model fine-tuning, limiting its scope of application.
In this paper, we leverage the powerful LLMs to synthesize task-specific labeled data for training the soft prompts.
We train soft prompts on both synthetic and real datasets using a gradient surgery approach.
arXiv Detail & Related papers (2024-10-08T01:00:02Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks [66.87070857705994]
In low-resource settings, the amount of seed data samples to use for data augmentation is very small.
We propose a novel method that augments training data by incorporating a wealth of examples from other datasets.
This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone.
arXiv Detail & Related papers (2024-02-21T02:45:46Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Practical Knowledge Distillation: Using DNNs to Beat DNNs [8.121769391666547]
We explore data and model distillation, as well as data denoising.
These techniques improve both gradient-boosting models and a specialized DNN architecture.
For an industry end-to-end real-time ML platform with 4M production inferences per second, we develop a model-training workflow based on data sampling.
Empirical evaluation shows that the proposed combination of methods consistently improves model accuracy over prior best models across several production applications deployed worldwide.
arXiv Detail & Related papers (2023-02-23T22:53:02Z) - DQI: Measuring Data Quality in NLP [22.54066527822898]
We introduce a generic formula for Data Quality Index (DQI) to help dataset creators create datasets free of unwanted biases.
We show that models trained on the renovated SNLI dataset generalize better to out of distribution tasks.
arXiv Detail & Related papers (2020-05-02T12:34:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.