Spot the Bot: Distinguishing Human-Written and Bot-Generated Texts Using
Clustering and Information Theory Techniques
- URL: http://arxiv.org/abs/2311.11441v1
- Date: Sun, 19 Nov 2023 22:29:15 GMT
- Title: Spot the Bot: Distinguishing Human-Written and Bot-Generated Texts Using
Clustering and Information Theory Techniques
- Authors: Vasilii Gromov and Quynh Nhu Dang
- Abstract summary: We propose a bot identification algorithm based on unsupervised learning techniques.
We find that the generated texts tend to be more chaotic while literary works are more complex.
We also demonstrate that the clustering of human texts results in fuzzier clusters in comparison to the more compact and well-separated clusters of bot-generated texts.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the development of generative models like GPT-3, it is increasingly more
challenging to differentiate generated texts from human-written ones. There is
a large number of studies that have demonstrated good results in bot
identification. However, the majority of such works depend on supervised
learning methods that require labelled data and/or prior knowledge about the
bot-model architecture. In this work, we propose a bot identification algorithm
that is based on unsupervised learning techniques and does not depend on a
large amount of labelled data. By combining findings in semantic analysis by
clustering (crisp and fuzzy) and information techniques, we construct a robust
model that detects a generated text for different types of bot. We find that
the generated texts tend to be more chaotic while literary works are more
complex. We also demonstrate that the clustering of human texts results in
fuzzier clusters in comparison to the more compact and well-separated clusters
of bot-generated texts.
Related papers
- Detecting Machine-Generated Long-Form Content with Latent-Space Variables [54.07946647012579]
Existing zero-shot detectors primarily focus on token-level distributions, which are vulnerable to real-world domain shifts.
We propose a more robust method that incorporates abstract elements, such as event transitions, as key deciding factors to detect machine versus human texts.
arXiv Detail & Related papers (2024-10-04T18:42:09Z) - Distinguishing Chatbot from Human [1.1249583407496218]
We develop a new dataset consisting of more than 750,000 human-written paragraphs.
Based on this dataset, we apply Machine Learning (ML) techniques to determine the origin of text.
Our proposed solutions offer high classification accuracy and serve as useful tools for textual analysis.
arXiv Detail & Related papers (2024-08-03T13:18:04Z) - Spot the bot: Coarse-Grained Partition of Semantic Paths for Bots and
Humans [55.2480439325792]
This paper focuses on comparing structures of the coarse-grained partitions of semantic paths for human-written and bot-generated texts.
As the semantic structure may be different for different languages, we investigate Russian, English, German, and Vietnamese languages.
arXiv Detail & Related papers (2024-02-27T10:38:37Z) - Text2Data: Low-Resource Data Generation with Textual Control [104.38011760992637]
Natural language serves as a common and straightforward control signal for humans to interact seamlessly with machines.
We propose Text2Data, a novel approach that utilizes unlabeled data to understand the underlying data distribution through an unsupervised diffusion model.
It undergoes controllable finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting.
arXiv Detail & Related papers (2024-02-08T03:41:39Z) - The Imitation Game: Detecting Human and AI-Generated Texts in the Era of
ChatGPT and BARD [3.2228025627337864]
We introduce a novel dataset of human-written and AI-generated texts in different genres.
We employ several machine learning models to classify the texts.
Results demonstrate the efficacy of these models in discerning between human and AI-generated text.
arXiv Detail & Related papers (2023-07-22T21:00:14Z) - On the Possibilities of AI-Generated Text Detection [76.55825911221434]
We argue that as machine-generated text approximates human-like quality, the sample size needed for detection bounds increases.
We test various state-of-the-art text generators, including GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against detectors, including oBERTa-Large/Base-Detector, GPTZero.
arXiv Detail & Related papers (2023-04-10T17:47:39Z) - Paraphrasing evades detectors of AI-generated text, but retrieval is an
effective defense [56.077252790310176]
We present a paraphrase generation model (DIPPER) that can paraphrase paragraphs, condition on surrounding context, and control lexical diversity and content reordering.
Using DIPPER to paraphrase text generated by three large language models (including GPT3.5-davinci-003) successfully evades several detectors, including watermarking.
We introduce a simple defense that relies on retrieving semantically-similar generations and must be maintained by a language model API provider.
arXiv Detail & Related papers (2023-03-23T16:29:27Z) - A Deep Learning Anomaly Detection Method in Textual Data [0.45687771576879593]
We propose using deep learning and transformer architectures combined with classical machine learning algorithms.
We used multiple machine learning methods such as Sentence Transformers, Autos, Logistic Regression and Distance calculation methods to predict anomalies.
arXiv Detail & Related papers (2022-11-25T05:18:13Z) - BeCAPTCHA-Type: Biometric Keystroke Data Generation for Improved Bot
Detection [63.447493500066045]
This work proposes a data driven learning model for the synthesis of keystroke biometric data.
The proposed method is compared with two statistical approaches based on Universal and User-dependent models.
Our experimental framework considers a dataset with 136 million keystroke events from 168 thousand subjects.
arXiv Detail & Related papers (2022-07-27T09:26:15Z) - Detecting Bot-Generated Text by Characterizing Linguistic Accommodation
in Human-Bot Interactions [9.578008322407928]
Language generation models' democratization makes it easier to generate human-like text at-scale for nefarious activities.
It is essential to understand how people interact with bots and develop methods to detect bot-generated text.
This paper shows that bot-generated text detection methods are more robust across datasets and models if we use information about how people respond to it.
arXiv Detail & Related papers (2021-06-02T14:10:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.