Spot the bot: Coarse-Grained Partition of Semantic Paths for Bots and
Humans
- URL: http://arxiv.org/abs/2402.17392v1
- Date: Tue, 27 Feb 2024 10:38:37 GMT
- Title: Spot the bot: Coarse-Grained Partition of Semantic Paths for Bots and
Humans
- Authors: Vasilii A. Gromov, Alexandra S. Kogan
- Abstract summary: This paper focuses on comparing structures of the coarse-grained partitions of semantic paths for human-written and bot-generated texts.
As the semantic structure may be different for different languages, we investigate Russian, English, German, and Vietnamese languages.
- Score: 55.2480439325792
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Nowadays, technology is rapidly advancing: bots are writing comments,
articles, and reviews. Due to this fact, it is crucial to know if the text was
written by a human or by a bot. This paper focuses on comparing structures of
the coarse-grained partitions of semantic paths for human-written and
bot-generated texts. We compare the clusterizations of datasets of n-grams from
literary texts and texts generated by several bots. The hypothesis is that the
structures and clusterizations are different. Our research supports the
hypothesis. As the semantic structure may be different for different languages,
we investigate Russian, English, German, and Vietnamese languages.
Related papers
- Detecting Machine-Generated Long-Form Content with Latent-Space Variables [54.07946647012579]
Existing zero-shot detectors primarily focus on token-level distributions, which are vulnerable to real-world domain shifts.
We propose a more robust method that incorporates abstract elements, such as event transitions, as key deciding factors to detect machine versus human texts.
arXiv Detail & Related papers (2024-10-04T18:42:09Z) - Sentiment analysis and random forest to classify LLM versus human source applied to Scientific Texts [0.0]
It is proposed a new methodology to classify texts coming from an automatic text production engine or a human.
Using four different sentiment lexicons, a number of new features where produced, and then fed to a machine learning random forest methodology, to train such a model.
Results seem very convincing that this may be a promising research line to detect fraud, in such environments where human are supposed to be the source of texts.
arXiv Detail & Related papers (2024-04-05T16:14:36Z) - Spot the Bot: Distinguishing Human-Written and Bot-Generated Texts Using
Clustering and Information Theory Techniques [0.0]
We propose a bot identification algorithm based on unsupervised learning techniques.
We find that the generated texts tend to be more chaotic while literary works are more complex.
We also demonstrate that the clustering of human texts results in fuzzier clusters in comparison to the more compact and well-separated clusters of bot-generated texts.
arXiv Detail & Related papers (2023-11-19T22:29:15Z) - Bot or Human? Detecting ChatGPT Imposters with A Single Question [29.231261118782925]
Large language models (LLMs) have recently demonstrated impressive capabilities in natural language understanding and generation.
There is a concern that they can be misused for malicious purposes, such as fraud or denial-of-service attacks.
We propose a framework named FLAIR, Finding Large Language Model Authenticity via a Single Inquiry and Response, to detect conversational bots in an online manner.
arXiv Detail & Related papers (2023-05-10T19:09:24Z) - A comparison of several AI techniques for authorship attribution on
Romanian texts [0.0]
We compare AI techniques for classifying literary texts written by multiple authors.
We also introduce a new dataset composed of texts written in the Romanian language on which we have run the algorithms.
arXiv Detail & Related papers (2022-11-09T20:24:48Z) - Textual Entailment Recognition with Semantic Features from Empirical
Text Representation [60.31047947815282]
A text entails a hypothesis if and only if the true value of the hypothesis follows the text.
In this paper, we propose a novel approach to identifying the textual entailment relationship between text and hypothesis.
We employ an element-wise Manhattan distance vector-based feature that can identify the semantic entailment relationship between the text-hypothesis pair.
arXiv Detail & Related papers (2022-10-18T10:03:51Z) - How much do language models copy from their training data? Evaluating
linguistic novelty in text generation using RAVEN [63.79300884115027]
Current language models can generate high-quality text.
Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions?
We introduce RAVEN, a suite of analyses for assessing the novelty of generated text.
arXiv Detail & Related papers (2021-11-18T04:07:09Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - Tortured phrases: A dubious writing style emerging in science. Evidence
of critical issues affecting established journals [69.76097138157816]
Probabilistic text generators have been used to produce fake scientific papers for more than a decade.
Complex AI-powered generation techniques produce texts indistinguishable from that of humans.
Some websites offer to rewrite texts for free, generating gobbledegook full of tortured phrases.
arXiv Detail & Related papers (2021-07-12T20:47:08Z) - Detecting Bot-Generated Text by Characterizing Linguistic Accommodation
in Human-Bot Interactions [9.578008322407928]
Language generation models' democratization makes it easier to generate human-like text at-scale for nefarious activities.
It is essential to understand how people interact with bots and develop methods to detect bot-generated text.
This paper shows that bot-generated text detection methods are more robust across datasets and models if we use information about how people respond to it.
arXiv Detail & Related papers (2021-06-02T14:10:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.