A Ship of Theseus: Curious Cases of Paraphrasing in LLM-Generated Texts
- URL: http://arxiv.org/abs/2311.08374v2
- Date: Thu, 6 Jun 2024 23:14:32 GMT
- Title: A Ship of Theseus: Curious Cases of Paraphrasing in LLM-Generated Texts
- Authors: Nafis Irtiza Tripto, Saranya Venkatraman, Dominik Macko, Robert Moro, Ivan Srba, Adaku Uchendu, Thai Le, Dongwon Lee,
- Abstract summary: Our research delves into an intriguing question: Does a text retain its original authorship when it undergoes numerous paraphrasing?
Using a computational approach, we discover that the diminishing performance in text classification models, with each successive paraphrasing, is closely associated with the extent of deviation from the original author's style.
- Score: 11.430810978707173
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the realm of text manipulation and linguistic transformation, the question of authorship has been a subject of fascination and philosophical inquiry. Much like the Ship of Theseus paradox, which ponders whether a ship remains the same when each of its original planks is replaced, our research delves into an intriguing question: Does a text retain its original authorship when it undergoes numerous paraphrasing iterations? Specifically, since Large Language Models (LLMs) have demonstrated remarkable proficiency in both the generation of original content and the modification of human-authored texts, a pivotal question emerges concerning the determination of authorship in instances where LLMs or similar paraphrasing tools are employed to rephrase the text--i.e., whether authorship should be attributed to the original human author or the AI-powered tool. Therefore, we embark on a philosophical voyage through the seas of language and authorship to unravel this intricate puzzle. Using a computational approach, we discover that the diminishing performance in text classification models, with each successive paraphrasing iteration, is closely associated with the extent of deviation from the original author's style, thus provoking a reconsideration of the current notion of authorship.
Related papers
- A Bayesian Approach to Harnessing the Power of LLMs in Authorship Attribution [57.309390098903]
Authorship attribution aims to identify the origin or author of a document.
Large Language Models (LLMs) with their deep reasoning capabilities and ability to maintain long-range textual associations offer a promising alternative.
Our results on the IMDb and blog datasets show an impressive 85% accuracy in one-shot authorship classification across ten authors.
arXiv Detail & Related papers (2024-10-29T04:14:23Z) - Are Paraphrases Generated by Large Language Models Invertible? [4.148732457277201]
We consider the problem of paraphrase inversion: given a paraphrased document, attempt to recover the original text.
We fine-tune paraphrase inversion models, both with and without additional author-specific context.
We show that, when starting from paraphrased machine-generated text, we can recover significant portions of the document using a learned inversion model.
arXiv Detail & Related papers (2024-10-29T00:46:24Z) - (Perhaps) Beyond Human Translation: Harnessing Multi-Agent Collaboration for Translating Ultra-Long Literary Texts [52.18246881218829]
We introduce a novel multi-agent framework based on large language models (LLMs) for literary translation, implemented as a company called TransAgents.
To evaluate the effectiveness of our system, we propose two innovative evaluation strategies: Monolingual Human Preference (MHP) and Bilingual LLM Preference (BLP)
arXiv Detail & Related papers (2024-05-20T05:55:08Z) - Towards Effective Paraphrasing for Information Disguise [13.356934367660811]
Research on Information Disguise (ID) becomes important when authors' written online communication pertains to sensitive domains.
We propose a framework where, for a given sentence from an author's post, we perform iterative perturbation on the sentence in the direction of paraphrasing.
Our work introduces a novel method of phrase-importance rankings using perplexity scores and involves multi-level phrase substitutions via beam search.
arXiv Detail & Related papers (2023-11-08T21:12:59Z) - MAGE: Machine-generated Text Detection in the Wild [82.70561073277801]
Large language models (LLMs) have achieved human-level text generation, emphasizing the need for effective AI-generated text detection.
We build a comprehensive testbed by gathering texts from diverse human writings and texts generated by different LLMs.
Despite challenges, the top-performing detector can identify 86.54% out-of-domain texts generated by a new LLM, indicating the feasibility for application scenarios.
arXiv Detail & Related papers (2023-05-22T17:13:29Z) - Textual Entailment Recognition with Semantic Features from Empirical
Text Representation [60.31047947815282]
A text entails a hypothesis if and only if the true value of the hypothesis follows the text.
In this paper, we propose a novel approach to identifying the textual entailment relationship between text and hypothesis.
We employ an element-wise Manhattan distance vector-based feature that can identify the semantic entailment relationship between the text-hypothesis pair.
arXiv Detail & Related papers (2022-10-18T10:03:51Z) - Can You Fool AI by Doing a 180? $\unicode{x2013}$ A Case Study on
Authorship Analysis of Texts by Arata Osada [2.6954666679827137]
This paper is an attempt at answering a twofold question covering the areas of ethics and authorship analysis.
Firstly, we were interested in finding out whether it would be possible for an author identification system to correctly attribute works to authors if in the course of years they have undergone a major psychological transition.
Secondly, and from the point of view of the evolution of an author's ethical values, we checked what it would mean if the authorship attribution system encounters difficulties in detecting single authorship.
arXiv Detail & Related papers (2022-07-19T05:43:49Z) - Tracing Text Provenance via Context-Aware Lexical Substitution [81.49359106648735]
We propose a natural language watermarking scheme based on context-aware lexical substitution.
Under both objective and subjective metrics, our watermarking scheme can well preserve the semantic integrity of original sentences.
arXiv Detail & Related papers (2021-12-15T04:27:33Z) - InvBERT: Text Reconstruction from Contextualized Embeddings used for
Derived Text Formats of Literary Works [1.6058099298620423]
Digital Humanities and Computational Literary Studies apply text mining methods to investigate literature.
Due to copyright restrictions, the availability of relevant digitized literary works is limited.
Our attempts to invert BERT suggest, that publishing parts of the encoder together with the contextualized embeddings is critical.
arXiv Detail & Related papers (2021-09-21T11:35:41Z) - Tortured phrases: A dubious writing style emerging in science. Evidence
of critical issues affecting established journals [69.76097138157816]
Probabilistic text generators have been used to produce fake scientific papers for more than a decade.
Complex AI-powered generation techniques produce texts indistinguishable from that of humans.
Some websites offer to rewrite texts for free, generating gobbledegook full of tortured phrases.
arXiv Detail & Related papers (2021-07-12T20:47:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.