Related papers: Unsupervised and Distributional Detection of Machine-Generated Text

Unsupervised and Distributional Detection of Machine-Generated Text

URL: http://arxiv.org/abs/2111.02878v1
Date: Thu, 4 Nov 2021 14:07:46 GMT
Title: Unsupervised and Distributional Detection of Machine-Generated Text
Authors: Matthias Gall\'e, Jos Rozen, Germ\'an Kruszewski, Hady Elsahar
Abstract summary: The power of natural language generation models has provoked a flurry of interest in automatic methods to detect if a piece of text is human or machine-authored. We propose a method to detect those machine-generated documents leveraging repeated higher-order n-grams. Our experiments show that leveraging that signal allows us to rank suspicious documents accurately.
Score: 1.552214657968262
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The power of natural language generation models has provoked a flurry of interest in automatic methods to detect if a piece of text is human or machine-authored. The problem so far has been framed in a standard supervised way and consists in training a classifier on annotated data to predict the origin of one given new document. In this paper, we frame the problem in an unsupervised and distributional way: we assume that we have access to a large collection of unannotated documents, a big fraction of which is machine-generated. We propose a method to detect those machine-generated documents leveraging repeated higher-order n-grams, which we show over-appear in machine-generated text as compared to human ones. That weak signal is the starting point of a self-training setting where pseudo-labelled documents are used to train an ensemble of classifiers. Our experiments show that leveraging that signal allows us to rank suspicious documents accurately. Precision at 5000 is over 90% for top-k sampling strategies, and over 80% for nucleus sampling for the largest model we used (GPT2-large). The drop with increased size of model is small, which could indicate that the results hold for other current and future large language models.

Related papers

Machine-generated text detection prevents language model collapse [17.34282527020344]
We investigate the impact of decoding strategy on model collapse.<n>We train a machine-generated text detector and propose an importance sampling approach to alleviate model collapse.
arXiv Detail & Related papers (2025-02-21T18:22:36Z)
Applying Ensemble Methods to Model-Agnostic Machine-Generated Text Detection [0.0]
We study the problem of detecting machine-generated text when the large language model it is possibly derived from is unknown. We use a zero-shot model for machine-generated text detection which is highly accurate when the generative (or base) language model is the same as the discriminative (or scoring) language model.
arXiv Detail & Related papers (2024-06-18T12:58:01Z)
Smaller Language Models are Better Black-box Machine-Generated Text Detectors [56.36291277897995]
Small and partially-trained models are better universal text detectors. We find that whether the detector and generator were trained on the same data is not critically important to the detection success. For instance, the OPT-125M model has an AUC of 0.81 in detecting ChatGPT generations, whereas a larger model from the GPT family, GPTJ-6B, has AUC of 0.45.
arXiv Detail & Related papers (2023-05-17T00:09:08Z)
CoCo: Coherence-Enhanced Machine-Generated Text Detection Under Data Limitation With Contrastive Learning [14.637303913878435]
We present a coherence-based contrastive learning model named CoCo to detect the possible MGT under low-resource scenario. To exploit the linguistic feature, we encode coherence information in form of graph into text representation. Experiment results on two public datasets and two self-constructed datasets prove our approach outperforms the state-of-art methods significantly.
arXiv Detail & Related papers (2022-12-20T15:26:19Z)
Classifiers are Better Experts for Controllable Text Generation [63.17266060165098]
We show that the proposed method significantly outperforms recent PPLM, GeDi, and DExperts on PPL and sentiment accuracy based on the external classifier of generated texts. The same time, it is also easier to implement and tune, and has significantly fewer restrictions and requirements.
arXiv Detail & Related papers (2022-05-15T12:58:35Z)
Autoregressive Search Engines: Generating Substrings as Document Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers. Previous work has explored ways to partition the search space into hierarchical structures. In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z)
Robust Document Representations using Latent Topics and Metadata [17.306088038339336]
We propose a novel approach to fine-tuning a pre-trained neural language model for document classification problems. We generate document representations that capture both text and metadata artifacts in a task manner. Our solution also incorporates metadata explicitly rather than just augmenting them with text.
arXiv Detail & Related papers (2020-10-23T21:52:38Z)
Neural Deepfake Detection with Factual Structure of Text [78.30080218908849]
We propose a graph-based model for deepfake detection of text. Our approach represents the factual structure of a given document as an entity graph. Our model can distinguish the difference in the factual structure between machine-generated text and human-written text.
arXiv Detail & Related papers (2020-10-15T02:35:31Z)
Evidence-Aware Inferential Text Generation with Vector Quantised Variational AutoEncoder [104.25716317141321]
We propose an approach that automatically finds evidence for an event from a large text corpus, and leverages the evidence to guide the generation of inferential texts. Our approach provides state-of-the-art performance on both Event2Mind and ATOMIC datasets.
arXiv Detail & Related papers (2020-06-15T02:59:52Z)
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators [108.3381301768299]
Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. We propose a more sample-efficient pre-training task called replaced token detection.
arXiv Detail & Related papers (2020-03-23T21:17:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.