Using Machine Learning to Distinguish Human-written from Machine-generated Creative Fiction
- URL: http://arxiv.org/abs/2412.15253v1
- Date: Sun, 15 Dec 2024 12:46:57 GMT
- Title: Using Machine Learning to Distinguish Human-written from Machine-generated Creative Fiction
- Authors: Andrea Cristina McGlinchey, Peter J Barclay,
- Abstract summary: Training a Large Language Model on writers' output to generate "sham books" in a particular style seems to constitute a new form of plagiarism.<n>In this study, we trained Machine Learning classifier models to distinguish short samples of human-written from machine-generated creative fiction.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Following the universal availability of generative AI systems with the release of ChatGPT, automatic detection of deceptive text created by Large Language Models has focused on domains such as academic plagiarism and "fake news". However, generative AI also poses a threat to the livelihood of creative writers, and perhaps to literary culture in general, through reduction in quality of published material. Training a Large Language Model on writers' output to generate "sham books" in a particular style seems to constitute a new form of plagiarism. This problem has been little researched. In this study, we trained Machine Learning classifier models to distinguish short samples of human-written from machine-generated creative fiction, focusing on classic detective novels. Our results show that a Naive Bayes and a Multi-Layer Perceptron classifier achieved a high degree of success (accuracy > 95%), significantly outperforming human judges (accuracy < 55%). This approach worked well with short text samples (around 100 words), which previous research has shown to be difficult to classify. We have deployed an online proof-of-concept classifier tool, AI Detective, as a first step towards developing lightweight and reliable applications for use by editors and publishers, with the aim of protecting the economic and cultural contribution of human authors.
Related papers
- "It was 80% me, 20% AI": Seeking Authenticity in Co-Writing with Large Language Models [97.22914355737676]
We examine whether and how writers want to preserve their authentic voice when co-writing with AI tools.
Our findings illuminate conceptions of authenticity in human-AI co-creation.
Readers' responses showed less concern about human-AI co-writing.
arXiv Detail & Related papers (2024-11-20T04:42:32Z) - AI as Humanity's Salieri: Quantifying Linguistic Creativity of Language Models via Systematic Attribution of Machine Text against Web Text [53.15652021126663]
We present CREATIVITY INDEX as the first step to quantify the linguistic creativity of a text.
To compute CREATIVITY INDEX efficiently, we introduce DJ SEARCH, a novel dynamic programming algorithm.
Experiments reveal that the CREATIVITY INDEX of professional human authors is on average 66.2% higher than that of LLMs.
arXiv Detail & Related papers (2024-10-05T18:55:01Z) - FOCUS: Forging Originality through Contrastive Use in Self-Plagiarism for Language Models [38.76912842622624]
Pre-trained Language Models (PLMs) have shown impressive results in various Natural Language Generation (NLG) tasks.
This study introduces a unique "self-plagiarism" contrastive decoding strategy, aimed at boosting the originality of text produced by PLMs.
arXiv Detail & Related papers (2024-06-02T19:17:00Z) - Forging the Forger: An Attempt to Improve Authorship Verification via Data Augmentation [52.72682366640554]
Authorship Verification (AV) is a text classification task concerned with inferring whether a candidate text has been written by one specific author or by someone else.
It has been shown that many AV systems are vulnerable to adversarial attacks, where a malicious author actively tries to fool the classifier by either concealing their writing style, or by imitating the style of another author.
arXiv Detail & Related papers (2024-03-17T16:36:26Z) - Few-Shot Detection of Machine-Generated Text using Style Representations [4.326503887981912]
Language models that convincingly mimic human writing pose a significant risk of abuse.
We propose to leverage representations of writing style estimated from human-authored text.
We find that features effective at distinguishing among human authors are also effective at distinguishing human from machine authors.
arXiv Detail & Related papers (2024-01-12T17:26:51Z) - AI Content Self-Detection for Transformer-based Large Language Models [0.0]
This paper introduces the idea of direct origin detection and evaluates whether generative AI systems can recognize their output and distinguish it from human-written texts.
Google's Bard model exhibits the largest capability of self-detection with an accuracy of 94%, followed by OpenAI's ChatGPT with 83%.
arXiv Detail & Related papers (2023-12-28T10:08:57Z) - Understanding writing style in social media with a supervised
contrastively pre-trained transformer [57.48690310135374]
Online Social Networks serve as fertile ground for harmful behavior, ranging from hate speech to the dissemination of disinformation.
We introduce the Style Transformer for Authorship Representations (STAR), trained on a large corpus derived from public sources of 4.5 x 106 authored texts.
Using a support base of 8 documents of 512 tokens, we can discern authors from sets of up to 1616 authors with at least 80% accuracy.
arXiv Detail & Related papers (2023-10-17T09:01:17Z) - Smaller Language Models are Better Black-box Machine-Generated Text
Detectors [56.36291277897995]
Small and partially-trained models are better universal text detectors.
We find that whether the detector and generator were trained on the same data is not critically important to the detection success.
For instance, the OPT-125M model has an AUC of 0.81 in detecting ChatGPT generations, whereas a larger model from the GPT family, GPTJ-6B, has AUC of 0.45.
arXiv Detail & Related papers (2023-05-17T00:09:08Z) - ChatGPT or academic scientist? Distinguishing authorship with over 99%
accuracy using off-the-shelf machine learning tools [0.0]
ChatGPT has enabled access to AI-generated writing for the masses.
Need to discriminate human writing from AI is now both critical and urgent.
We developed a method for discriminating text generated by ChatGPT from (human) academic scientists.
arXiv Detail & Related papers (2023-03-28T23:16:00Z) - Twist Decoding: Diverse Generators Guide Each Other [116.20780037268801]
We introduce Twist decoding, a simple and general inference algorithm that generates text while benefiting from diverse models.
Our method does not assume the vocabulary, tokenization or even generation order is shared.
arXiv Detail & Related papers (2022-05-19T01:27:53Z) - Are You Robert or RoBERTa? Deceiving Online Authorship Attribution
Models Using Neural Text Generators [3.9533044769534444]
GPT-2 and XLM language models are used to generate texts using existing posts of online users.
We then examine whether these AI-based text generators are capable of mimicking authorial style to such a degree that they can deceive typical AA models.
Our findings highlight the current capacity of powerful natural language models to generate original online posts capable of mimicking authorial style.
arXiv Detail & Related papers (2022-03-18T09:19:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.