A ground-truth dataset and classification model for detecting bots in
GitHub issue and PR comments
- URL: http://arxiv.org/abs/2010.03303v2
- Date: Tue, 19 Jan 2021 14:22:51 GMT
- Title: A ground-truth dataset and classification model for detecting bots in
GitHub issue and PR comments
- Authors: Mehdi Golzadeh, Alexandre Decan, Damien Legay and Tom Mens
- Abstract summary: Bots are used in Github repositories to automate repetitive activities that are part of the distributed software development process.
This paper proposes a ground-truth dataset, based on a manual analysis with high interrater agreement, of pull request and issue comments in 5,000 distinct Github accounts.
We propose an automated classification model to detect bots, taking as main features the number of empty and non-empty comments of each account, the number of comment patterns, and the inequality between comments within comment patterns.
- Score: 70.1864008701113
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Bots are frequently used in Github repositories to automate repetitive
activities that are part of the distributed software development process. They
communicate with human actors through comments. While detecting their presence
is important for many reasons, no large and representative ground-truth dataset
is available, nor are classification models to detect and validate bots on the
basis of such a dataset. This paper proposes a ground-truth dataset, based on a
manual analysis with high interrater agreement, of pull request and issue
comments in 5,000 distinct Github accounts of which 527 have been identified as
bots. Using this dataset we propose an automated classification model to detect
bots, taking as main features the number of empty and non-empty comments of
each account, the number of comment patterns, and the inequality between
comments within comment patterns. We obtained a very high weighted average
precision, recall and F1-score of 0.98 on a test set containing 40% of the
data. We integrated the classification model into an open source command-line
tool to allow practitioners to detect which accounts in a given Github
repository actually correspond to bots.
Related papers
- Fact Checking Beyond Training Set [64.88575826304024]
We show that the retriever-reader suffers from performance deterioration when it is trained on labeled data from one domain and used in another domain.
We propose an adversarial algorithm to make the retriever component robust against distribution shift.
We then construct eight fact checking scenarios from these datasets, and compare our model to a set of strong baseline models.
arXiv Detail & Related papers (2024-03-27T15:15:14Z) - BotHawk: An Approach for Bots Detection in Open Source Software Projects [4.59229477803039]
This research aims to investigate bots' behavior in open-source software projects and identify bot accounts with maximum possible accuracy.
We've identified four types of bot accounts in open-source software projects by analyzing their behavior across 17 features in 5 dimensions.
Our team created BotHawk, a highly effective model for detecting bots in open-source software projects.
arXiv Detail & Related papers (2023-07-25T10:15:38Z) - BotArtist: Generic approach for bot detection in Twitter via semi-automatic machine learning pipeline [47.61306219245444]
Twitter has become a target for bots and fake accounts, resulting in the spread of false information and manipulation.
This paper introduces a semi-automatic machine learning pipeline (SAMLP) designed to address the challenges correlated with machine learning model development.
We develop a comprehensive bot detection model named BotArtist, based on user profile features.
arXiv Detail & Related papers (2023-05-31T09:12:35Z) - BotShape: A Novel Social Bots Detection Approach via Behavioral Patterns [4.386183132284449]
Based on a real-world data set, we construct behavioral sequences from raw event logs.
We observe differences between bots and genuine users and similar patterns among bot accounts.
We present a novel social bot detection system BotShape, to automatically catch behavioral sequences and characteristics.
arXiv Detail & Related papers (2023-03-17T19:03:06Z) - Simplistic Collection and Labeling Practices Limit the Utility of
Benchmark Datasets for Twitter Bot Detection [3.8428576920007083]
We show that high performance is attributable to limitations in dataset collection and labeling rather than sophistication of the tools.
Our findings have important implications for both transparency in sampling and labeling procedures and potential biases in research.
arXiv Detail & Related papers (2023-01-17T17:05:55Z) - BeCAPTCHA-Type: Biometric Keystroke Data Generation for Improved Bot
Detection [63.447493500066045]
This work proposes a data driven learning model for the synthesis of keystroke biometric data.
The proposed method is compared with two statistical approaches based on Universal and User-dependent models.
Our experimental framework considers a dataset with 136 million keystroke events from 168 thousand subjects.
arXiv Detail & Related papers (2022-07-27T09:26:15Z) - Intrinsic Certified Robustness of Bagging against Data Poisoning Attacks [75.46678178805382]
In a emphdata poisoning attack, an attacker modifies, deletes, and/or inserts some training examples to corrupt the learnt machine learning model.
We prove the intrinsic certified robustness of bagging against data poisoning attacks.
Our method achieves a certified accuracy of $91.1%$ on MNIST when arbitrarily modifying, deleting, and/or inserting 100 training examples.
arXiv Detail & Related papers (2020-08-11T03:12:42Z) - Detection of Novel Social Bots by Ensembles of Specialized Classifiers [60.63582690037839]
Malicious actors create inauthentic social media accounts controlled in part by algorithms, known as social bots, to disseminate misinformation and agitate online discussion.
We show that different types of bots are characterized by different behavioral features.
We propose a new supervised learning method that trains classifiers specialized for each class of bots and combines their decisions through the maximum rule.
arXiv Detail & Related papers (2020-06-11T22:59:59Z) - Detecting and Characterizing Bots that Commit Code [16.10540443996897]
We propose a systematic approach to detect bots using author names, commit messages, files modified by the commit, and projects associated with the ommits.
We have compiled a shareable dataset containing detailed information about 461 bots we found (all of whom have more than 1000 commits) and 13,762,430 commits they created.
arXiv Detail & Related papers (2020-03-02T21:54:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.