Spot the Bot: Distinguishing Human-Written and Bot-Generated Texts Using
Clustering and Information Theory Techniques
- URL: http://arxiv.org/abs/2311.11441v1
- Date: Sun, 19 Nov 2023 22:29:15 GMT
- Title: Spot the Bot: Distinguishing Human-Written and Bot-Generated Texts Using
Clustering and Information Theory Techniques
- Authors: Vasilii Gromov and Quynh Nhu Dang
- Abstract summary: We propose a bot identification algorithm based on unsupervised learning techniques.
We find that the generated texts tend to be more chaotic while literary works are more complex.
We also demonstrate that the clustering of human texts results in fuzzier clusters in comparison to the more compact and well-separated clusters of bot-generated texts.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the development of generative models like GPT-3, it is increasingly more
challenging to differentiate generated texts from human-written ones. There is
a large number of studies that have demonstrated good results in bot
identification. However, the majority of such works depend on supervised
learning methods that require labelled data and/or prior knowledge about the
bot-model architecture. In this work, we propose a bot identification algorithm
that is based on unsupervised learning techniques and does not depend on a
large amount of labelled data. By combining findings in semantic analysis by
clustering (crisp and fuzzy) and information techniques, we construct a robust
model that detects a generated text for different types of bot. We find that
the generated texts tend to be more chaotic while literary works are more
complex. We also demonstrate that the clustering of human texts results in
fuzzier clusters in comparison to the more compact and well-separated clusters
of bot-generated texts.
Related papers
- Spot the bot: Coarse-Grained Partition of Semantic Paths for Bots and
Humans [55.2480439325792]
This paper focuses on comparing structures of the coarse-grained partitions of semantic paths for human-written and bot-generated texts.
As the semantic structure may be different for different languages, we investigate Russian, English, German, and Vietnamese languages.
arXiv Detail & Related papers (2024-02-27T10:38:37Z) - Text2Data: Low-Resource Data Generation with Textual Control [104.38011760992637]
Natural language serves as a common and straightforward control signal for humans to interact seamlessly with machines.
We propose Text2Data, a novel approach that utilizes unlabeled data to understand the underlying data distribution through an unsupervised diffusion model.
It undergoes controllable finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting.
arXiv Detail & Related papers (2024-02-08T03:41:39Z) - The Imitation Game: Detecting Human and AI-Generated Texts in the Era of
ChatGPT and BARD [3.2228025627337864]
We introduce a novel dataset of human-written and AI-generated texts in different genres.
We employ several machine learning models to classify the texts.
Results demonstrate the efficacy of these models in discerning between human and AI-generated text.
arXiv Detail & Related papers (2023-07-22T21:00:14Z) - Distinguishing Human Generated Text From ChatGPT Generated Text Using
Machine Learning [0.251657752676152]
This paper presents a machine learning-based solution that can identify the ChatGPT delivered text from the human written text.
We have tested the proposed model on a Kaggle dataset consisting of 10,000 texts out of which 5,204 texts were written by humans and collected from news and social media.
On the corpus generated by GPT-3.5, the proposed algorithm presents an accuracy of 77%.
arXiv Detail & Related papers (2023-05-26T09:27:43Z) - On the Possibilities of AI-Generated Text Detection [76.55825911221434]
We argue that as machine-generated text approximates human-like quality, the sample size needed for detection bounds increases.
We test various state-of-the-art text generators, including GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against detectors, including oBERTa-Large/Base-Detector, GPTZero.
arXiv Detail & Related papers (2023-04-10T17:47:39Z) - Paraphrasing evades detectors of AI-generated text, but retrieval is an
effective defense [56.077252790310176]
We present a paraphrase generation model (DIPPER) that can paraphrase paragraphs, condition on surrounding context, and control lexical diversity and content reordering.
Using DIPPER to paraphrase text generated by three large language models (including GPT3.5-davinci-003) successfully evades several detectors, including watermarking.
We introduce a simple defense that relies on retrieving semantically-similar generations and must be maintained by a language model API provider.
arXiv Detail & Related papers (2023-03-23T16:29:27Z) - A Deep Learning Anomaly Detection Method in Textual Data [0.45687771576879593]
We propose using deep learning and transformer architectures combined with classical machine learning algorithms.
We used multiple machine learning methods such as Sentence Transformers, Autos, Logistic Regression and Distance calculation methods to predict anomalies.
arXiv Detail & Related papers (2022-11-25T05:18:13Z) - BeCAPTCHA-Type: Biometric Keystroke Data Generation for Improved Bot
Detection [63.447493500066045]
This work proposes a data driven learning model for the synthesis of keystroke biometric data.
The proposed method is compared with two statistical approaches based on Universal and User-dependent models.
Our experimental framework considers a dataset with 136 million keystroke events from 168 thousand subjects.
arXiv Detail & Related papers (2022-07-27T09:26:15Z) - Detecting Bot-Generated Text by Characterizing Linguistic Accommodation
in Human-Bot Interactions [9.578008322407928]
Language generation models' democratization makes it easier to generate human-like text at-scale for nefarious activities.
It is essential to understand how people interact with bots and develop methods to detect bot-generated text.
This paper shows that bot-generated text detection methods are more robust across datasets and models if we use information about how people respond to it.
arXiv Detail & Related papers (2021-06-02T14:10:28Z) - Detection of Novel Social Bots by Ensembles of Specialized Classifiers [60.63582690037839]
Malicious actors create inauthentic social media accounts controlled in part by algorithms, known as social bots, to disseminate misinformation and agitate online discussion.
We show that different types of bots are characterized by different behavioral features.
We propose a new supervised learning method that trains classifiers specialized for each class of bots and combines their decisions through the maximum rule.
arXiv Detail & Related papers (2020-06-11T22:59:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.