GenAug: Data Augmentation for Finetuning Text Generators
- URL: http://arxiv.org/abs/2010.01794v2
- Date: Sat, 10 Oct 2020 06:00:03 GMT
- Title: GenAug: Data Augmentation for Finetuning Text Generators
- Authors: Steven Y. Feng, Varun Gangal, Dongyeop Kang, Teruko Mitamura, Eduard
Hovy
- Abstract summary: We propose and evaluate various augmentation methods, including some that incorporate external knowledge, for finetuning GPT-2 on a subset of Yelp Reviews.
Our experiments demonstrate that insertion of character-level synthetic noise and keyword replacement with hypernyms are effective augmentation methods.
- Score: 21.96895115572357
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we investigate data augmentation for text generation, which we
call GenAug. Text generation and language modeling are important tasks within
natural language processing, and are especially challenging for low-data
regimes. We propose and evaluate various augmentation methods, including some
that incorporate external knowledge, for finetuning GPT-2 on a subset of Yelp
Reviews. We also examine the relationship between the amount of augmentation
and the quality of the generated text. We utilize several metrics that evaluate
important aspects of the generated text including its diversity and fluency.
Our experiments demonstrate that insertion of character-level synthetic noise
and keyword replacement with hypernyms are effective augmentation methods, and
that the quality of generations improves to a peak at approximately three times
the amount of original data.
Related papers
- Decoding Decoded: Understanding Hyperparameter Effects in Open-Ended Text Generation [0.22499166814992438]
Decoding strategies for large language models (LLMs) are a critical but often underexplored aspect of text generation tasks.
We present a large-scale, comprehensive analysis of how hyper parameter selection affects text quality in open-ended text generation.
arXiv Detail & Related papers (2024-10-08T14:51:03Z) - Topic-to-essay generation with knowledge-based content selection [1.0625748132006634]
We propose a novel copy mechanism model with a content selection module that integrates rich semantic knowledge from the language model into the decoder.
Experimental results demonstrate that the proposed model can improve the generated text diversity by 35% to 59% compared to the state-of-the-art method.
arXiv Detail & Related papers (2024-02-26T02:14:42Z) - RankAug: Augmented data ranking for text classification [0.0]
RankAug is a text-ranking approach that detects and filters out the top augmented texts.
We demonstrate that the judicious selection of filtering techniques can yield a substantial improvement of up to 35% in classification accuracy for under-represented classes.
arXiv Detail & Related papers (2023-11-08T08:47:49Z) - A Benchmark for Text Expansion: Datasets, Metrics, and Baselines [87.47745669317894]
This work presents a new task of Text Expansion (TE), which aims to insert fine-grained modifier into proper locations of the plain text.
We leverage four complementary approaches to construct a dataset with 12 million automatically generated instances and 2K human-annotated references.
On top of a pre-trained text-infilling model, we build both pipelined and joint Locate&Infill models, which demonstrate the superiority over the Text2Text baselines.
arXiv Detail & Related papers (2023-09-17T07:54:38Z) - On the Possibilities of AI-Generated Text Detection [76.55825911221434]
We argue that as machine-generated text approximates human-like quality, the sample size needed for detection bounds increases.
We test various state-of-the-art text generators, including GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against detectors, including oBERTa-Large/Base-Detector, GPTZero.
arXiv Detail & Related papers (2023-04-10T17:47:39Z) - Sequentially Controlled Text Generation [97.22539956688443]
GPT-2 generates sentences that are remarkably human-like, longer documents can ramble and do not follow human-like writing structure.
We study the problem of imposing structure on long-range text.
We develop a sequential controlled text generation pipeline with generation and editing.
arXiv Detail & Related papers (2023-01-05T21:23:51Z) - A Benchmark Corpus for the Detection of Automatically Generated Text in
Academic Publications [0.02578242050187029]
This paper presents two datasets comprised of artificially generated research content.
In the first case, the content is completely generated by the GPT-2 model after a short prompt extracted from original papers.
The partial or hybrid dataset is created by replacing several sentences of abstracts with sentences that are generated by the Arxiv-NLP model.
We evaluate the quality of the datasets comparing the generated texts to aligned original texts using fluency metrics such as BLEU and ROUGE.
arXiv Detail & Related papers (2022-02-04T08:16:56Z) - A Survey on Retrieval-Augmented Text Generation [53.04991859796971]
Retrieval-augmented text generation has remarkable advantages and has achieved state-of-the-art performance in many NLP tasks.
It firstly highlights the generic paradigm of retrieval-augmented generation, and then it reviews notable approaches according to different tasks.
arXiv Detail & Related papers (2022-02-02T16:18:41Z) - Compression, Transduction, and Creation: A Unified Framework for
Evaluating Natural Language Generation [85.32991360774447]
Natural language generation (NLG) spans a broad range of tasks, each of which serves for specific objectives.
We propose a unifying perspective based on the nature of information change in NLG tasks.
We develop a family of interpretable metrics that are suitable for evaluating key aspects of different NLG tasks.
arXiv Detail & Related papers (2021-09-14T01:00:42Z) - Controllable Text Generation with Focused Variation [71.07811310799664]
Focused-Variation Network (FVN) is a novel model to control language generation.
FVN learns disjoint discrete latent spaces for each attribute inside codebooks, which allows for both controllability and diversity.
We evaluate FVN on two text generation datasets with annotated content and style, and show state-of-the-art performance as assessed by automatic and human evaluations.
arXiv Detail & Related papers (2020-09-25T06:31:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.