BotArtist: Generic approach for bot detection in Twitter via semi-automatic machine learning pipeline
- URL: http://arxiv.org/abs/2306.00037v4
- Date: Mon, 22 Jul 2024 11:40:15 GMT
- Title: BotArtist: Generic approach for bot detection in Twitter via semi-automatic machine learning pipeline
- Authors: Alexander Shevtsov, Despoina Antonakaki, Ioannis Lamprou, Polyvios Pratikakis, Sotiris Ioannidis,
- Abstract summary: Twitter has become a target for bots and fake accounts, resulting in the spread of false information and manipulation.
This paper introduces a semi-automatic machine learning pipeline (SAMLP) designed to address the challenges correlated with machine learning model development.
We develop a comprehensive bot detection model named BotArtist, based on user profile features.
- Score: 47.61306219245444
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Twitter, as one of the most popular social networks, provides a platform for communication and online discourse. Unfortunately, it has also become a target for bots and fake accounts, resulting in the spread of false information and manipulation. This paper introduces a semi-automatic machine learning pipeline (SAMLP) designed to address the challenges correlated with machine learning model development. Through this pipeline, we develop a comprehensive bot detection model named BotArtist, based on user profile features. SAMLP leverages nine distinct publicly available datasets to train the BotArtist model. To assess BotArtist's performance against current state-of-the-art solutions, we select 35 existing Twitter bot detection methods, each utilizing a diverse range of features. Our comparative evaluation of BotArtist and these existing methods, conducted across nine public datasets under standardized conditions, reveals that the proposed model outperforms existing solutions by almost 10%, in terms of F1-score, achieving an average score of 83.19 and 68.5 over specific and general approaches respectively. As a result of this research, we provide a dataset of the extracted features combined with BotArtist predictions over the 10.929.533 Twitter user profiles, collected via Twitter API during the 2022 Russo-Ukrainian War, over a 16-month period. This dataset was created in collaboration with [Shevtsov et al., 2022a] where the original authors share anonymized tweets on the discussion of the Russo-Ukrainian war with a total amount of 127.275.386 tweets. The combination of the existing text dataset and the provided labeled bot and human profiles will allow for the future development of a more advanced bot detection large language model in the post-Twitter API era.
Related papers
- My Brother Helps Me: Node Injection Based Adversarial Attack on Social Bot Detection [69.99192868521564]
Social platforms such as Twitter are under siege from a multitude of fraudulent users.
Due to the structure of social networks, the majority of methods are based on the graph neural network(GNN), which is susceptible to attacks.
We propose a node injection-based adversarial attack method designed to deceive bot detection models.
arXiv Detail & Related papers (2023-10-11T03:09:48Z) - Context-Based Tweet Engagement Prediction [0.0]
This thesis investigates how well context alone may be used to predict tweet engagement likelihood.
We employed the Spark engine on TU Wien's Little Big Data Cluster to create scalable data preprocessing, feature engineering, feature selection, and machine learning pipelines.
We also found that factors such as the prediction algorithm, training dataset size, training dataset sampling method, and feature selection significantly affect the results.
arXiv Detail & Related papers (2023-09-28T08:36:57Z) - BeCAPTCHA-Type: Biometric Keystroke Data Generation for Improved Bot
Detection [63.447493500066045]
This work proposes a data driven learning model for the synthesis of keystroke biometric data.
The proposed method is compared with two statistical approaches based on Universal and User-dependent models.
Our experimental framework considers a dataset with 136 million keystroke events from 168 thousand subjects.
arXiv Detail & Related papers (2022-07-27T09:26:15Z) - TwiBot-22: Towards Graph-Based Twitter Bot Detection [39.359825215347655]
TwiBot-22 is a graph-based Twitter bot detection benchmark that presents the largest dataset to date.
We re-implement 35 representative Twitter bot detection baselines and evaluate them on 9 datasets, including TwiBot-22.
To facilitate further research, we consolidate all implemented codes and datasets into the TwiBot-22 evaluation framework.
arXiv Detail & Related papers (2022-06-09T15:23:37Z) - Identification of Twitter Bots based on an Explainable ML Framework: the
US 2020 Elections Case Study [72.61531092316092]
This paper focuses on the design of a novel system for identifying Twitter bots based on labeled Twitter data.
Supervised machine learning (ML) framework is adopted using an Extreme Gradient Boosting (XGBoost) algorithm.
Our study also deploys Shapley Additive Explanations (SHAP) for explaining the ML model predictions.
arXiv Detail & Related papers (2021-12-08T14:12:24Z) - BotSpot: Deep Learning Classification of Bot Accounts within Twitter [2.099922236065961]
The openness feature of Twitter allows programs to generate and control Twitter accounts automatically via the Twitter API.
These accounts, which are known as bots, can automatically perform actions such as tweeting, re-tweeting, following, unfollowing, or direct messaging other accounts.
We introduce a novel bot detection approach using deep learning, with the Multi-layer Perceptron Neural Networks and nine features of a bot account.
arXiv Detail & Related papers (2021-09-08T15:17:10Z) - A ground-truth dataset and classification model for detecting bots in
GitHub issue and PR comments [70.1864008701113]
Bots are used in Github repositories to automate repetitive activities that are part of the distributed software development process.
This paper proposes a ground-truth dataset, based on a manual analysis with high interrater agreement, of pull request and issue comments in 5,000 distinct Github accounts.
We propose an automated classification model to detect bots, taking as main features the number of empty and non-empty comments of each account, the number of comment patterns, and the inequality between comments within comment patterns.
arXiv Detail & Related papers (2020-10-07T09:30:52Z) - Detection of Novel Social Bots by Ensembles of Specialized Classifiers [60.63582690037839]
Malicious actors create inauthentic social media accounts controlled in part by algorithms, known as social bots, to disseminate misinformation and agitate online discussion.
We show that different types of bots are characterized by different behavioral features.
We propose a new supervised learning method that trains classifiers specialized for each class of bots and combines their decisions through the maximum rule.
arXiv Detail & Related papers (2020-06-11T22:59:59Z) - Twitter Bot Detection Using Bidirectional Long Short-term Memory Neural
Networks and Word Embeddings [6.09170287691728]
This paper develops a recurrent neural model with word embeddings to distinguish Twitter bots from human accounts.
Experiments show that our approach can achieve competitive performance compared with existing state-of-the-art bot detection systems.
arXiv Detail & Related papers (2020-02-03T17:07:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.