Unsupervised and Distributional Detection of Machine-Generated Text
- URL: http://arxiv.org/abs/2111.02878v1
- Date: Thu, 4 Nov 2021 14:07:46 GMT
- Title: Unsupervised and Distributional Detection of Machine-Generated Text
- Authors: Matthias Gall\'e, Jos Rozen, Germ\'an Kruszewski, Hady Elsahar
- Abstract summary: The power of natural language generation models has provoked a flurry of interest in automatic methods to detect if a piece of text is human or machine-authored.
We propose a method to detect those machine-generated documents leveraging repeated higher-order n-grams.
Our experiments show that leveraging that signal allows us to rank suspicious documents accurately.
- Score: 1.552214657968262
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The power of natural language generation models has provoked a flurry of
interest in automatic methods to detect if a piece of text is human or
machine-authored. The problem so far has been framed in a standard supervised
way and consists in training a classifier on annotated data to predict the
origin of one given new document. In this paper, we frame the problem in an
unsupervised and distributional way: we assume that we have access to a large
collection of unannotated documents, a big fraction of which is
machine-generated. We propose a method to detect those machine-generated
documents leveraging repeated higher-order n-grams, which we show over-appear
in machine-generated text as compared to human ones. That weak signal is the
starting point of a self-training setting where pseudo-labelled documents are
used to train an ensemble of classifiers. Our experiments show that leveraging
that signal allows us to rank suspicious documents accurately. Precision at
5000 is over 90% for top-k sampling strategies, and over 80% for nucleus
sampling for the largest model we used (GPT2-large). The drop with increased
size of model is small, which could indicate that the results hold for other
current and future large language models.
Related papers
- Applying Ensemble Methods to Model-Agnostic Machine-Generated Text Detection [0.0]
We study the problem of detecting machine-generated text when the large language model it is possibly derived from is unknown.
We use a zero-shot model for machine-generated text detection which is highly accurate when the generative (or base) language model is the same as the discriminative (or scoring) language model.
arXiv Detail & Related papers (2024-06-18T12:58:01Z) - Smaller Language Models are Better Black-box Machine-Generated Text
Detectors [56.36291277897995]
Small and partially-trained models are better universal text detectors.
We find that whether the detector and generator were trained on the same data is not critically important to the detection success.
For instance, the OPT-125M model has an AUC of 0.81 in detecting ChatGPT generations, whereas a larger model from the GPT family, GPTJ-6B, has AUC of 0.45.
arXiv Detail & Related papers (2023-05-17T00:09:08Z) - CoCo: Coherence-Enhanced Machine-Generated Text Detection Under Data
Limitation With Contrastive Learning [14.637303913878435]
We present a coherence-based contrastive learning model named CoCo to detect the possible MGT under low-resource scenario.
To exploit the linguistic feature, we encode coherence information in form of graph into text representation.
Experiment results on two public datasets and two self-constructed datasets prove our approach outperforms the state-of-art methods significantly.
arXiv Detail & Related papers (2022-12-20T15:26:19Z) - Classifiers are Better Experts for Controllable Text Generation [63.17266060165098]
We show that the proposed method significantly outperforms recent PPLM, GeDi, and DExperts on PPL and sentiment accuracy based on the external classifier of generated texts.
The same time, it is also easier to implement and tune, and has significantly fewer restrictions and requirements.
arXiv Detail & Related papers (2022-05-15T12:58:35Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - Robust Document Representations using Latent Topics and Metadata [17.306088038339336]
We propose a novel approach to fine-tuning a pre-trained neural language model for document classification problems.
We generate document representations that capture both text and metadata artifacts in a task manner.
Our solution also incorporates metadata explicitly rather than just augmenting them with text.
arXiv Detail & Related papers (2020-10-23T21:52:38Z) - Neural Deepfake Detection with Factual Structure of Text [78.30080218908849]
We propose a graph-based model for deepfake detection of text.
Our approach represents the factual structure of a given document as an entity graph.
Our model can distinguish the difference in the factual structure between machine-generated text and human-written text.
arXiv Detail & Related papers (2020-10-15T02:35:31Z) - Evidence-Aware Inferential Text Generation with Vector Quantised
Variational AutoEncoder [104.25716317141321]
We propose an approach that automatically finds evidence for an event from a large text corpus, and leverages the evidence to guide the generation of inferential texts.
Our approach provides state-of-the-art performance on both Event2Mind and ATOMIC datasets.
arXiv Detail & Related papers (2020-06-15T02:59:52Z) - ELECTRA: Pre-training Text Encoders as Discriminators Rather Than
Generators [108.3381301768299]
Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens.
We propose a more sample-efficient pre-training task called replaced token detection.
arXiv Detail & Related papers (2020-03-23T21:17:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.