Are You Robert or RoBERTa? Deceiving Online Authorship Attribution
Models Using Neural Text Generators
- URL: http://arxiv.org/abs/2203.09813v1
- Date: Fri, 18 Mar 2022 09:19:14 GMT
- Title: Are You Robert or RoBERTa? Deceiving Online Authorship Attribution
Models Using Neural Text Generators
- Authors: Keenan Jones, Jason R. C. Nurse, Shujun Li
- Abstract summary: GPT-2 and XLM language models are used to generate texts using existing posts of online users.
We then examine whether these AI-based text generators are capable of mimicking authorial style to such a degree that they can deceive typical AA models.
Our findings highlight the current capacity of powerful natural language models to generate original online posts capable of mimicking authorial style.
- Score: 3.9533044769534444
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, there has been a rise in the development of powerful pre-trained
natural language models, including GPT-2, Grover, and XLM. These models have
shown state-of-the-art capabilities towards a variety of different NLP tasks,
including question answering, content summarisation, and text generation.
Alongside this, there have been many studies focused on online authorship
attribution (AA). That is, the use of models to identify the authors of online
texts. Given the power of natural language models in generating convincing
texts, this paper examines the degree to which these language models can
generate texts capable of deceiving online AA models. Experimenting with both
blog and Twitter data, we utilise GPT-2 language models to generate texts using
the existing posts of online users. We then examine whether these AI-based text
generators are capable of mimicking authorial style to such a degree that they
can deceive typical AA models. From this, we find that current AI-based text
generators are able to successfully mimic authorship, showing capabilities
towards this on both datasets. Our findings, in turn, highlight the current
capacity of powerful natural language models to generate original online posts
capable of mimicking authorial style sufficiently to deceive popular AA
methods; a key finding given the proposed role of AA in real world applications
such as spam-detection and forensic investigation.
Related papers
- Language Models can Self-Lengthen to Generate Long Texts [74.96074422345806]
This paper introduces an innovative iterative training framework called Self-Lengthen.
It leverages only the intrinsic knowledge and skills of Large Language Models without the need for auxiliary data or proprietary models.
Experiments on benchmarks and human evaluations show that Self-Lengthen outperforms existing methods in long-text generation.
arXiv Detail & Related papers (2024-10-31T13:47:10Z) - FOCUS: Forging Originality through Contrastive Use in Self-Plagiarism for Language Models [38.76912842622624]
Pre-trained Language Models (PLMs) have shown impressive results in various Natural Language Generation (NLG) tasks.
This study introduces a unique "self-plagiarism" contrastive decoding strategy, aimed at boosting the originality of text produced by PLMs.
arXiv Detail & Related papers (2024-06-02T19:17:00Z) - AI Content Self-Detection for Transformer-based Large Language Models [0.0]
This paper introduces the idea of direct origin detection and evaluates whether generative AI systems can recognize their output and distinguish it from human-written texts.
Google's Bard model exhibits the largest capability of self-detection with an accuracy of 94%, followed by OpenAI's ChatGPT with 83%.
arXiv Detail & Related papers (2023-12-28T10:08:57Z) - TextDiffuser-2: Unleashing the Power of Language Models for Text
Rendering [118.30923824681642]
TextDiffuser-2 aims to unleash the power of language models for text rendering.
We utilize the language model within the diffusion model to encode the position and texts at the line level.
We conduct extensive experiments and incorporate user studies involving human participants as well as GPT-4V.
arXiv Detail & Related papers (2023-11-28T04:02:40Z) - Generative AI Text Classification using Ensemble LLM Approaches [0.12483023446237698]
Large Language Models (LLMs) have shown impressive performance across a variety of AI and natural language processing tasks.
We propose an ensemble neural model that generates probabilities from different pre-trained LLMs.
For the first task of distinguishing between AI and human generated text, our model ranked in fifth and thirteenth place.
arXiv Detail & Related papers (2023-09-14T14:41:46Z) - BatGPT: A Bidirectional Autoregessive Talker from Generative Pre-trained
Transformer [77.28871523946418]
BatGPT is a large-scale language model designed and trained jointly by Wuhan University and Shanghai Jiao Tong University.
It is capable of generating highly natural and fluent text in response to various types of input, including text prompts, images, and audio.
arXiv Detail & Related papers (2023-07-01T15:10:01Z) - Extrapolating Multilingual Understanding Models as Multilingual
Generators [82.1355802012414]
This paper explores methods to empower multilingual understanding models the generation abilities to get a unified model.
We propose a textbfSemantic-textbfGuided textbfAlignment-then-Denoising (SGA) approach to adapt an encoder to a multilingual generator with a small number of new parameters.
arXiv Detail & Related papers (2023-05-22T15:33:21Z) - Smaller Language Models are Better Black-box Machine-Generated Text
Detectors [56.36291277897995]
Small and partially-trained models are better universal text detectors.
We find that whether the detector and generator were trained on the same data is not critically important to the detection success.
For instance, the OPT-125M model has an AUC of 0.81 in detecting ChatGPT generations, whereas a larger model from the GPT family, GPTJ-6B, has AUC of 0.45.
arXiv Detail & Related papers (2023-05-17T00:09:08Z) - ChatGPT is not all you need. A State of the Art Review of large
Generative AI models [0.0]
This work consists on an attempt to describe in a concise way the main models that are affected by generative AI and to provide a taxonomy of the main generative models published recently.
arXiv Detail & Related papers (2023-01-11T15:48:36Z) - How much do language models copy from their training data? Evaluating
linguistic novelty in text generation using RAVEN [63.79300884115027]
Current language models can generate high-quality text.
Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions?
We introduce RAVEN, a suite of analyses for assessing the novelty of generated text.
arXiv Detail & Related papers (2021-11-18T04:07:09Z) - BERT Transformer model for Detecting Arabic GPT2 Auto-Generated Tweets [6.18447297698017]
We propose a transfer learning based model that will be able to detect if an Arabic sentence is written by humans or automatically generated by bots.
Our new transfer-learning model has obtained an accuracy up to 98%.
To the best of our knowledge, this work is the first study where ARABERT and GPT2 were combined to detect and classify the Arabic auto-generated texts.
arXiv Detail & Related papers (2021-01-22T21:50:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.