Tortured phrases: A dubious writing style emerging in science. Evidence
of critical issues affecting established journals
- URL: http://arxiv.org/abs/2107.06751v1
- Date: Mon, 12 Jul 2021 20:47:08 GMT
- Title: Tortured phrases: A dubious writing style emerging in science. Evidence
of critical issues affecting established journals
- Authors: Guillaume Cabanac and Cyril Labb\'e and Alexander Magazinov
- Abstract summary: Probabilistic text generators have been used to produce fake scientific papers for more than a decade.
Complex AI-powered generation techniques produce texts indistinguishable from that of humans.
Some websites offer to rewrite texts for free, generating gobbledegook full of tortured phrases.
- Score: 69.76097138157816
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Probabilistic text generators have been used to produce fake scientific
papers for more than a decade. Such nonsensical papers are easily detected by
both human and machine. Now more complex AI-powered generation techniques
produce texts indistinguishable from that of humans and the generation of
scientific texts from a few keywords has been documented. Our study introduces
the concept of tortured phrases: unexpected weird phrases in lieu of
established ones, such as 'counterfeit consciousness' instead of 'artificial
intelligence.' We combed the literature for tortured phrases and study one
reputable journal where these concentrated en masse. Hypothesising the use of
advanced language models we ran a detector on the abstracts of recent articles
of this journal and on several control sets. The pairwise comparisons reveal a
concentration of abstracts flagged as 'synthetic' in the journal. We also
highlight irregularities in its operation, such as abrupt changes in editorial
timelines. We substantiate our call for investigation by analysing several
individual dubious articles, stressing questionable features: tortured writing
style, citation of non-existent literature, and unacknowledged image reuse.
Surprisingly, some websites offer to rewrite texts for free, generating
gobbledegook full of tortured phrases. We believe some authors used rewritten
texts to pad their manuscripts. We wish to raise the awareness on publications
containing such questionable AI-generated or rewritten texts that passed (poor)
peer review. Deception with synthetic texts threatens the integrity of the
scientific literature.
Related papers
- A Ship of Theseus: Curious Cases of Paraphrasing in LLM-Generated Texts [11.430810978707173]
Our research delves into an intriguing question: Does a text retain its original authorship when it undergoes numerous paraphrasing?
Using a computational approach, we discover that the diminishing performance in text classification models, with each successive paraphrasing, is closely associated with the extent of deviation from the original author's style.
arXiv Detail & Related papers (2023-11-14T18:40:42Z) - Towards Effective Paraphrasing for Information Disguise [13.356934367660811]
Research on Information Disguise (ID) becomes important when authors' written online communication pertains to sensitive domains.
We propose a framework where, for a given sentence from an author's post, we perform iterative perturbation on the sentence in the direction of paraphrasing.
Our work introduces a novel method of phrase-importance rankings using perplexity scores and involves multi-level phrase substitutions via beam search.
arXiv Detail & Related papers (2023-11-08T21:12:59Z) - Cited Text Spans for Citation Text Generation [12.039469573641217]
An automatic citation generation system aims to concisely and accurately describe the relationship between two scientific articles.
Due to the length of scientific documents, existing abstractive approaches have conditioned only on cited paper abstracts.
We propose to condition instead on the cited text span (CTS) as an alternative to the abstract.
arXiv Detail & Related papers (2023-09-12T16:28:36Z) - MAGE: Machine-generated Text Detection in the Wild [82.70561073277801]
Large language models (LLMs) have achieved human-level text generation, emphasizing the need for effective AI-generated text detection.
We build a comprehensive testbed by gathering texts from diverse human writings and texts generated by different LLMs.
Despite challenges, the top-performing detector can identify 86.54% out-of-domain texts generated by a new LLM, indicating the feasibility for application scenarios.
arXiv Detail & Related papers (2023-05-22T17:13:29Z) - Synthetically generated text for supervised text analysis [5.71097144710995]
I provide a conceptual overview of text generation, guidance on when researchers should prefer different techniques for generating synthetic text, a discussion of ethics, and a simple technique for improving the quality of synthetic text.
I demonstrate the usefulness of synthetic text with three applications: generating synthetic tweets describing the fighting in Ukraine, synthetic news articles describing specified political events for training an event detection system, and a multilingual corpus of populist manifesto statements for training a sentence-level populism classifier.
arXiv Detail & Related papers (2023-03-28T14:55:13Z) - Can AI-Generated Text be Reliably Detected? [54.670136179857344]
Unregulated use of LLMs can potentially lead to malicious consequences such as plagiarism, generating fake news, spamming, etc.
Recent works attempt to tackle this problem either using certain model signatures present in the generated text outputs or by applying watermarking techniques.
In this paper, we show that these detectors are not reliable in practical scenarios.
arXiv Detail & Related papers (2023-03-17T17:53:19Z) - Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models.
We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - CiteBench: A benchmark for Scientific Citation Text Generation [69.37571393032026]
CiteBench is a benchmark for citation text generation.
We make the code for CiteBench publicly available at https://github.com/UKPLab/citebench.
arXiv Detail & Related papers (2022-12-19T16:10:56Z) - Investigating the detection of Tortured Phrases in Scientific Literature [0.0]
A recent study introduced the concept of 'tortured phrase', an unexpected odd phrase that appears instead of the fixed expression.
The present study aims at investigating how tortured phrases, that are not yet listed, can be detected automatically.
arXiv Detail & Related papers (2022-10-24T08:15:22Z) - Towards generating citation sentences for multiple references with
intent control [86.53829532976303]
We build a novel generation model with the Fusion-in-Decoder approach to cope with multiple long inputs.
Experiments demonstrate that the proposed approaches provide much more comprehensive features for generating citation sentences.
arXiv Detail & Related papers (2021-12-02T15:32:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.