Related papers: GenAug: Data Augmentation for Finetuning Text Generators

GenAug: Data Augmentation for Finetuning Text Generators

URL: http://arxiv.org/abs/2010.01794v2
Date: Sat, 10 Oct 2020 06:00:03 GMT
Title: GenAug: Data Augmentation for Finetuning Text Generators
Authors: Steven Y. Feng, Varun Gangal, Dongyeop Kang, Teruko Mitamura, Eduard Hovy
Abstract summary: We propose and evaluate various augmentation methods, including some that incorporate external knowledge, for finetuning GPT-2 on a subset of Yelp Reviews. Our experiments demonstrate that insertion of character-level synthetic noise and keyword replacement with hypernyms are effective augmentation methods.
Score: 21.96895115572357
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we investigate data augmentation for text generation, which we call GenAug. Text generation and language modeling are important tasks within natural language processing, and are especially challenging for low-data regimes. We propose and evaluate various augmentation methods, including some that incorporate external knowledge, for finetuning GPT-2 on a subset of Yelp Reviews. We also examine the relationship between the amount of augmentation and the quality of the generated text. We utilize several metrics that evaluate important aspects of the generated text including its diversity and fluency. Our experiments demonstrate that insertion of character-level synthetic noise and keyword replacement with hypernyms are effective augmentation methods, and that the quality of generations improves to a peak at approximately three times the amount of original data.

Related papers

Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation. We introduce novel methodologies and datasets to overcome these challenges. We propose MhBART, an encoder-decoder model designed to emulate human writing style. We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z)
Decoding Decoded: Understanding Hyperparameter Effects in Open-Ended Text Generation [0.22499166814992438]
Decoding strategies for large language models (LLMs) are a critical but often underexplored aspect of text generation tasks. We present a large-scale, comprehensive analysis of how hyper parameter selection affects text quality in open-ended text generation.
arXiv Detail & Related papers (2024-10-08T14:51:03Z)
Topic-to-essay generation with knowledge-based content selection [1.0625748132006634]
We propose a novel copy mechanism model with a content selection module that integrates rich semantic knowledge from the language model into the decoder. Experimental results demonstrate that the proposed model can improve the generated text diversity by 35% to 59% compared to the state-of-the-art method.
arXiv Detail & Related papers (2024-02-26T02:14:42Z)
RankAug: Augmented data ranking for text classification [0.0]
RankAug is a text-ranking approach that detects and filters out the top augmented texts. We demonstrate that the judicious selection of filtering techniques can yield a substantial improvement of up to 35% in classification accuracy for under-represented classes.
arXiv Detail & Related papers (2023-11-08T08:47:49Z)
A Benchmark for Text Expansion: Datasets, Metrics, and Baselines [87.47745669317894]
This work presents a new task of Text Expansion (TE), which aims to insert fine-grained modifier into proper locations of the plain text. We leverage four complementary approaches to construct a dataset with 12 million automatically generated instances and 2K human-annotated references. On top of a pre-trained text-infilling model, we build both pipelined and joint Locate&Infill models, which demonstrate the superiority over the Text2Text baselines.
arXiv Detail & Related papers (2023-09-17T07:54:38Z)
On the Possibilities of AI-Generated Text Detection [76.55825911221434]
We argue that as machine-generated text approximates human-like quality, the sample size needed for detection bounds increases. We test various state-of-the-art text generators, including GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against detectors, including oBERTa-Large/Base-Detector, GPTZero.
arXiv Detail & Related papers (2023-04-10T17:47:39Z)
Sequentially Controlled Text Generation [97.22539956688443]
GPT-2 generates sentences that are remarkably human-like, longer documents can ramble and do not follow human-like writing structure. We study the problem of imposing structure on long-range text. We develop a sequential controlled text generation pipeline with generation and editing.
arXiv Detail & Related papers (2023-01-05T21:23:51Z)
A Benchmark Corpus for the Detection of Automatically Generated Text in Academic Publications [0.02578242050187029]
This paper presents two datasets comprised of artificially generated research content. In the first case, the content is completely generated by the GPT-2 model after a short prompt extracted from original papers. The partial or hybrid dataset is created by replacing several sentences of abstracts with sentences that are generated by the Arxiv-NLP model. We evaluate the quality of the datasets comparing the generated texts to aligned original texts using fluency metrics such as BLEU and ROUGE.
arXiv Detail & Related papers (2022-02-04T08:16:56Z)
A Survey on Retrieval-Augmented Text Generation [53.04991859796971]
Retrieval-augmented text generation has remarkable advantages and has achieved state-of-the-art performance in many NLP tasks. It firstly highlights the generic paradigm of retrieval-augmented generation, and then it reviews notable approaches according to different tasks.
arXiv Detail & Related papers (2022-02-02T16:18:41Z)
Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation [85.32991360774447]
Natural language generation (NLG) spans a broad range of tasks, each of which serves for specific objectives. We propose a unifying perspective based on the nature of information change in NLG tasks. We develop a family of interpretable metrics that are suitable for evaluating key aspects of different NLG tasks.
arXiv Detail & Related papers (2021-09-14T01:00:42Z)
From Machine Translation to Code-Switching: Generating High-Quality Code-Switched Text [14.251949110756078]
We adapt a state-of-the-art neural machine translation model to generate Hindi-English code-switched sentences. We show significant reductions in perplexity on a language modeling task. We also show improvements using our text for a downstream code-switched natural language inference task.
arXiv Detail & Related papers (2021-07-14T04:46:39Z)
Controllable Text Generation with Focused Variation [71.07811310799664]
Focused-Variation Network (FVN) is a novel model to control language generation. FVN learns disjoint discrete latent spaces for each attribute inside codebooks, which allows for both controllability and diversity. We evaluate FVN on two text generation datasets with annotated content and style, and show state-of-the-art performance as assessed by automatic and human evaluations.
arXiv Detail & Related papers (2020-09-25T06:31:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.