GenAug: Data Augmentation for Finetuning Text Generators
- URL: http://arxiv.org/abs/2010.01794v2
- Date: Sat, 10 Oct 2020 06:00:03 GMT
- Title: GenAug: Data Augmentation for Finetuning Text Generators
- Authors: Steven Y. Feng, Varun Gangal, Dongyeop Kang, Teruko Mitamura, Eduard
Hovy
- Abstract summary: We propose and evaluate various augmentation methods, including some that incorporate external knowledge, for finetuning GPT-2 on a subset of Yelp Reviews.
Our experiments demonstrate that insertion of character-level synthetic noise and keyword replacement with hypernyms are effective augmentation methods.
- Score: 21.96895115572357
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we investigate data augmentation for text generation, which we
call GenAug. Text generation and language modeling are important tasks within
natural language processing, and are especially challenging for low-data
regimes. We propose and evaluate various augmentation methods, including some
that incorporate external knowledge, for finetuning GPT-2 on a subset of Yelp
Reviews. We also examine the relationship between the amount of augmentation
and the quality of the generated text. We utilize several metrics that evaluate
important aspects of the generated text including its diversity and fluency.
Our experiments demonstrate that insertion of character-level synthetic noise
and keyword replacement with hypernyms are effective augmentation methods, and
that the quality of generations improves to a peak at approximately three times
the amount of original data.
Related papers
- Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.
We introduce novel methodologies and datasets to overcome these challenges.
We propose MhBART, an encoder-decoder model designed to emulate human writing style.
We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z) - Topic-to-essay generation with knowledge-based content selection [1.0625748132006634]
We propose a novel copy mechanism model with a content selection module that integrates rich semantic knowledge from the language model into the decoder.
Experimental results demonstrate that the proposed model can improve the generated text diversity by 35% to 59% compared to the state-of-the-art method.
arXiv Detail & Related papers (2024-02-26T02:14:42Z) - RankAug: Augmented data ranking for text classification [0.0]
RankAug is a text-ranking approach that detects and filters out the top augmented texts.
We demonstrate that the judicious selection of filtering techniques can yield a substantial improvement of up to 35% in classification accuracy for under-represented classes.
arXiv Detail & Related papers (2023-11-08T08:47:49Z) - On the Possibilities of AI-Generated Text Detection [76.55825911221434]
We argue that as machine-generated text approximates human-like quality, the sample size needed for detection bounds increases.
We test various state-of-the-art text generators, including GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against detectors, including oBERTa-Large/Base-Detector, GPTZero.
arXiv Detail & Related papers (2023-04-10T17:47:39Z) - Sequentially Controlled Text Generation [97.22539956688443]
GPT-2 generates sentences that are remarkably human-like, longer documents can ramble and do not follow human-like writing structure.
We study the problem of imposing structure on long-range text.
We develop a sequential controlled text generation pipeline with generation and editing.
arXiv Detail & Related papers (2023-01-05T21:23:51Z) - A Benchmark Corpus for the Detection of Automatically Generated Text in
Academic Publications [0.02578242050187029]
This paper presents two datasets comprised of artificially generated research content.
In the first case, the content is completely generated by the GPT-2 model after a short prompt extracted from original papers.
The partial or hybrid dataset is created by replacing several sentences of abstracts with sentences that are generated by the Arxiv-NLP model.
We evaluate the quality of the datasets comparing the generated texts to aligned original texts using fluency metrics such as BLEU and ROUGE.
arXiv Detail & Related papers (2022-02-04T08:16:56Z) - A Survey on Retrieval-Augmented Text Generation [53.04991859796971]
Retrieval-augmented text generation has remarkable advantages and has achieved state-of-the-art performance in many NLP tasks.
It firstly highlights the generic paradigm of retrieval-augmented generation, and then it reviews notable approaches according to different tasks.
arXiv Detail & Related papers (2022-02-02T16:18:41Z) - Compression, Transduction, and Creation: A Unified Framework for
Evaluating Natural Language Generation [85.32991360774447]
Natural language generation (NLG) spans a broad range of tasks, each of which serves for specific objectives.
We propose a unifying perspective based on the nature of information change in NLG tasks.
We develop a family of interpretable metrics that are suitable for evaluating key aspects of different NLG tasks.
arXiv Detail & Related papers (2021-09-14T01:00:42Z) - From Machine Translation to Code-Switching: Generating High-Quality
Code-Switched Text [14.251949110756078]
We adapt a state-of-the-art neural machine translation model to generate Hindi-English code-switched sentences.
We show significant reductions in perplexity on a language modeling task.
We also show improvements using our text for a downstream code-switched natural language inference task.
arXiv Detail & Related papers (2021-07-14T04:46:39Z) - Controllable Text Generation with Focused Variation [71.07811310799664]
Focused-Variation Network (FVN) is a novel model to control language generation.
FVN learns disjoint discrete latent spaces for each attribute inside codebooks, which allows for both controllability and diversity.
We evaluate FVN on two text generation datasets with annotated content and style, and show state-of-the-art performance as assessed by automatic and human evaluations.
arXiv Detail & Related papers (2020-09-25T06:31:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.