ZeroShotDataAug: Generating and Augmenting Training Data with ChatGPT
- URL: http://arxiv.org/abs/2304.14334v1
- Date: Thu, 27 Apr 2023 17:07:29 GMT
- Title: ZeroShotDataAug: Generating and Augmenting Training Data with ChatGPT
- Authors: Solomon Ubani, Suleyman Olcay Polat, Rodney Nielsen
- Abstract summary: We investigate the use of data obtained from prompting a large generative language model, ChatGPT, to generate synthetic training data with the aim of augmenting data in low resource scenarios.
We show that with appropriate task-specific ChatGPT prompts, we outperform the most popular existing approaches for such data augmentation.
- Score: 2.320417845168326
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we investigate the use of data obtained from prompting a large
generative language model, ChatGPT, to generate synthetic training data with
the aim of augmenting data in low resource scenarios. We show that with
appropriate task-specific ChatGPT prompts, we outperform the most popular
existing approaches for such data augmentation. Furthermore, we investigate
methodologies for evaluating the similarity of the augmented data generated
from ChatGPT with the aim of validating and assessing the quality of the data
generated.
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Targeted synthetic data generation for tabular data via hardness characterization [0.0]
We introduce a novel augmentation pipeline that generates only high-value training points based on hardness characterization.
We show that synthetic data generators trained on the hardest points outperform non-targeted data augmentation on simulated data and on a large scale credit default prediction task.
arXiv Detail & Related papers (2024-10-01T14:54:26Z) - A Comprehensive Survey on Data Augmentation [55.355273602421384]
Data augmentation is a technique that generates high-quality artificial data by manipulating existing data samples.
Existing literature surveys only focus on a certain type of specific modality data.
We propose a more enlightening taxonomy that encompasses data augmentation techniques for different common data modalities.
arXiv Detail & Related papers (2024-05-15T11:58:08Z) - Benchmarking and Analyzing Generative Data for Visual Recognition [66.55174903469722]
This work delves into the impact of generative images, primarily comparing paradigms that harness external data.
We devise textbfGenBench, a benchmark comprising 22 datasets with 2548 categories, to appraise generative data across various visual recognition tasks.
Our exhaustive benchmark and analysis spotlight generative data's promise in visual recognition, while identifying key challenges for future investigation.
arXiv Detail & Related papers (2023-07-25T17:59:59Z) - Does Synthetic Data Generation of LLMs Help Clinical Text Mining? [51.205078179427645]
We investigate the potential of OpenAI's ChatGPT to aid in clinical text mining.
We propose a new training paradigm that involves generating a vast quantity of high-quality synthetic data.
Our method has resulted in significant improvements in the performance of downstream tasks.
arXiv Detail & Related papers (2023-03-08T03:56:31Z) - AugGPT: Leveraging ChatGPT for Text Data Augmentation [59.76140039943385]
We propose a text data augmentation approach based on ChatGPT (named AugGPT)
AugGPT rephrases each sentence in the training samples into multiple conceptually similar but semantically different samples.
Experiment results on few-shot learning text classification tasks show the superior performance of the proposed AugGPT approach.
arXiv Detail & Related papers (2023-02-25T06:58:16Z) - Data Augmentation for Intent Classification with Off-the-shelf Large
Language Models [13.895236210726202]
We propose a prompting-based approach to generate labelled training data for intent classification with off-the-shelf language models.
We evaluate the proposed method in a few-shot setting on four diverse intent classification tasks.
arXiv Detail & Related papers (2022-04-05T03:29:26Z) - Generative Conversational Networks [67.13144697969501]
We propose a framework called Generative Conversational Networks, in which conversational agents learn to generate their own labelled training data.
We show an average improvement of 35% in intent detection and 21% in slot tagging over a baseline model trained from the seed data.
arXiv Detail & Related papers (2021-06-15T23:19:37Z) - Data Weighted Training Strategies for Grammatical Error Correction [8.370770440898454]
We show how to incorporate delta-log-perplexity, a type of example scoring, into a training schedule for Grammatical Error Correction (GEC)
Models trained on scored data achieve state-of-the-art results on common GEC test sets.
arXiv Detail & Related papers (2020-08-07T03:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.