How much do language models copy from their training data? Evaluating
linguistic novelty in text generation using RAVEN
- URL: http://arxiv.org/abs/2111.09509v1
- Date: Thu, 18 Nov 2021 04:07:09 GMT
- Title: How much do language models copy from their training data? Evaluating
linguistic novelty in text generation using RAVEN
- Authors: R. Thomas McCoy, Paul Smolensky, Tal Linzen, Jianfeng Gao, Asli
Celikyilmaz
- Abstract summary: Current language models can generate high-quality text.
Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions?
We introduce RAVEN, a suite of analyses for assessing the novelty of generated text.
- Score: 63.79300884115027
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current language models can generate high-quality text. Are they simply
copying text they have seen before, or have they learned generalizable
linguistic abstractions? To tease apart these possibilities, we introduce
RAVEN, a suite of analyses for assessing the novelty of generated text,
focusing on sequential structure (n-grams) and syntactic structure. We apply
these analyses to four neural language models (an LSTM, a Transformer,
Transformer-XL, and GPT-2). For local structure - e.g., individual dependencies
- model-generated text is substantially less novel than our baseline of
human-generated text from each model's test set. For larger-scale structure -
e.g., overall sentence structure - model-generated text is as novel or even
more novel than the human-generated baseline, but models still sometimes copy
substantially, in some cases duplicating passages over 1,000 words long from
the training set. We also perform extensive manual analysis showing that
GPT-2's novel text is usually well-formed morphologically and syntactically but
has reasonably frequent semantic issues (e.g., being self-contradictory).
Related papers
- Detection and Measurement of Syntactic Templates in Generated Text [58.111650675717414]
We offer an analysis of syntactic features to characterize general repetition in models.
We find that models tend to produce templated text in downstream tasks at a higher rate than what is found in human-reference texts.
arXiv Detail & Related papers (2024-06-28T19:34:23Z) - Retrieval is Accurate Generation [99.24267226311157]
We introduce a novel method that selects context-aware phrases from a collection of supporting documents.
Our model achieves the best performance and the lowest latency among several retrieval-augmented baselines.
arXiv Detail & Related papers (2024-02-27T14:16:19Z) - Threads of Subtlety: Detecting Machine-Generated Texts Through Discourse Motifs [19.073560504913356]
The line between human-crafted and machine-generated texts has become increasingly blurred.
This paper delves into the inquiry of identifying discernible and unique linguistic properties in texts that were written by humans.
arXiv Detail & Related papers (2024-02-16T11:20:30Z) - Deep dive into language traits of AI-generated Abstracts [5.209583971923267]
Generative language models, such as ChatGPT, have garnered attention for their ability to generate human-like writing.
In this work, we attempt to detect the Abstracts generated by ChatGPT, which are much shorter in length and bounded.
We extract the texts semantic and lexical properties and observe that traditional machine learning models can confidently detect these Abstracts.
arXiv Detail & Related papers (2023-12-17T06:03:33Z) - Detecting Text Formality: A Study of Text Classification Approaches [78.11745751651708]
This work proposes the first to our knowledge systematic study of formality detection methods based on statistical, neural-based, and Transformer-based machine learning methods.
We conducted three types of experiments -- monolingual, multilingual, and cross-lingual.
The study shows the overcome of Char BiLSTM model over Transformer-based ones for the monolingual and multilingual formality classification task.
arXiv Detail & Related papers (2022-04-19T16:23:07Z) - Uniform Complexity for Text Generation [4.867923281108005]
We introduce Uniform Complexity for Text Generation (UCTG), a new benchmark test which raises the challenge of making generative models observe uniform linguistic properties with respect to prompts.
We find that models such as GPT-2 struggle to preserve the complexity of input prompts used in its generations, even if finetuned with professionally written texts.
arXiv Detail & Related papers (2022-04-11T15:19:47Z) - A Benchmark Corpus for the Detection of Automatically Generated Text in
Academic Publications [0.02578242050187029]
This paper presents two datasets comprised of artificially generated research content.
In the first case, the content is completely generated by the GPT-2 model after a short prompt extracted from original papers.
The partial or hybrid dataset is created by replacing several sentences of abstracts with sentences that are generated by the Arxiv-NLP model.
We evaluate the quality of the datasets comparing the generated texts to aligned original texts using fluency metrics such as BLEU and ROUGE.
arXiv Detail & Related papers (2022-02-04T08:16:56Z) - Neural Deepfake Detection with Factual Structure of Text [78.30080218908849]
We propose a graph-based model for deepfake detection of text.
Our approach represents the factual structure of a given document as an entity graph.
Our model can distinguish the difference in the factual structure between machine-generated text and human-written text.
arXiv Detail & Related papers (2020-10-15T02:35:31Z) - Progressive Generation of Long Text with Pretrained Language Models [83.62523163717448]
Large-scale language models (LMs) pretrained on massive corpora of text, such as GPT-2, are powerful open-domain text generators.
It is still challenging for such models to generate coherent long passages of text, especially when the models are fine-tuned to the target domain on a small corpus.
We propose a simple but effective method of generating text in a progressive manner, inspired by generating images from low to high resolution.
arXiv Detail & Related papers (2020-06-28T21:23:05Z) - Russian Natural Language Generation: Creation of a Language Modelling
Dataset and Evaluation with Modern Neural Architectures [0.0]
We provide a novel reference dataset for Russian language modeling.
We experiment with popular modern methods for text generation, namely variational autoencoders, and generative adversarial networks.
We evaluate the generated text regarding metrics such as perplexity, grammatical correctness and lexical diversity.
arXiv Detail & Related papers (2020-05-05T20:20:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.