Investigating the detection of Tortured Phrases in Scientific Literature
- URL: http://arxiv.org/abs/2210.13024v1
- Date: Mon, 24 Oct 2022 08:15:22 GMT
- Title: Investigating the detection of Tortured Phrases in Scientific Literature
- Authors: Puthineath Lay, Martin Lentschat and Cyril Labb\'e
- Abstract summary: A recent study introduced the concept of 'tortured phrase', an unexpected odd phrase that appears instead of the fixed expression.
The present study aims at investigating how tortured phrases, that are not yet listed, can be detected automatically.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: With the help of online tools, unscrupulous authors can today generate a
pseudo-scientific article and attempt to publish it. Some of these tools work
by replacing or paraphrasing existing texts to produce new content, but they
have a tendency to generate nonsensical expressions. A recent study introduced
the concept of 'tortured phrase', an unexpected odd phrase that appears instead
of the fixed expression. E.g. counterfeit consciousness instead of artificial
intelligence. The present study aims at investigating how tortured phrases,
that are not yet listed, can be detected automatically. We conducted several
experiments, including non-neural binary classification, neural binary
classification and cosine similarity comparison of the phrase tokens, yielding
noticeable results.
Related papers
- An Analysis of BPE Vocabulary Trimming in Neural Machine Translation [56.383793805299234]
vocabulary trimming is a postprocessing step that replaces rare subwords with their component subwords.
We show that vocabulary trimming fails to improve performance and is even prone to incurring heavy degradation.
arXiv Detail & Related papers (2024-03-30T15:29:49Z) - Forging the Forger: An Attempt to Improve Authorship Verification via Data Augmentation [52.72682366640554]
Authorship Verification (AV) is a text classification task concerned with inferring whether a candidate text has been written by one specific author or by someone else.
It has been shown that many AV systems are vulnerable to adversarial attacks, where a malicious author actively tries to fool the classifier by either concealing their writing style, or by imitating the style of another author.
arXiv Detail & Related papers (2024-03-17T16:36:26Z) - Detection of tortured phrases in scientific literature [0.0]
This paper presents various automatic detection methods to extract so called tortured phrases from scientific papers.
With a recall value of.87 and a precision value of.61, it could retrieve new tortured phrases to be submitted to domain experts for validation.
arXiv Detail & Related papers (2024-02-02T08:15:43Z) - Towards Effective Paraphrasing for Information Disguise [13.356934367660811]
Research on Information Disguise (ID) becomes important when authors' written online communication pertains to sensitive domains.
We propose a framework where, for a given sentence from an author's post, we perform iterative perturbation on the sentence in the direction of paraphrasing.
Our work introduces a novel method of phrase-importance rankings using perplexity scores and involves multi-level phrase substitutions via beam search.
arXiv Detail & Related papers (2023-11-08T21:12:59Z) - Verifying the Robustness of Automatic Credibility Assessment [79.08422736721764]
Text classification methods have been widely investigated as a way to detect content of low credibility.
In some cases insignificant changes in input text can mislead the models.
We introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - What artificial intelligence might teach us about the origin of human
language [91.3755431537592]
This study explores a pattern emerging from research that combines artificial intelligence with sound symbolism.
Machine learning algorithms are efficient learners of sound symbolism, but they tend to bias one category over the other.
arXiv Detail & Related papers (2023-01-15T23:25:29Z) - Paraphrase Identification with Deep Learning: A Review of Datasets and Methods [1.4325734372991794]
We investigate how the under-representation of certain paraphrase types in popular datasets affects the ability to detect plagiarism.
We introduce and validate a new refined typology for paraphrases.
We propose new directions for future research and dataset development to enhance AI-based paraphrase detection.
arXiv Detail & Related papers (2022-12-13T23:06:20Z) - Studying word order through iterative shuffling [14.530986799844873]
We show that word order encodes meaning essential to performing NLP benchmark tasks.
We use IBIS, a novel, efficient procedure that finds the ordering of a bag of words having the highest likelihood under a fixed language model.
We discuss how shuffling inference procedures such as IBIS can benefit language modeling and constrained generation.
arXiv Detail & Related papers (2021-09-10T13:27:06Z) - Tortured phrases: A dubious writing style emerging in science. Evidence
of critical issues affecting established journals [69.76097138157816]
Probabilistic text generators have been used to produce fake scientific papers for more than a decade.
Complex AI-powered generation techniques produce texts indistinguishable from that of humans.
Some websites offer to rewrite texts for free, generating gobbledegook full of tortured phrases.
arXiv Detail & Related papers (2021-07-12T20:47:08Z) - UCPhrase: Unsupervised Context-aware Quality Phrase Tagging [63.86606855524567]
UCPhrase is a novel unsupervised context-aware quality phrase tagger.
We induce high-quality phrase spans as silver labels from consistently co-occurring word sequences.
We show that our design is superior to state-of-the-art pre-trained, unsupervised, and distantly supervised methods.
arXiv Detail & Related papers (2021-05-28T19:44:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.