Simplistic Collection and Labeling Practices Limit the Utility of
Benchmark Datasets for Twitter Bot Detection
- URL: http://arxiv.org/abs/2301.07015v2
- Date: Mon, 1 May 2023 16:40:41 GMT
- Title: Simplistic Collection and Labeling Practices Limit the Utility of
Benchmark Datasets for Twitter Bot Detection
- Authors: Chris Hays, Zachary Schutzman, Manish Raghavan, Erin Walk and Philipp
Zimmer
- Abstract summary: We show that high performance is attributable to limitations in dataset collection and labeling rather than sophistication of the tools.
Our findings have important implications for both transparency in sampling and labeling procedures and potential biases in research.
- Score: 3.8428576920007083
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate bot detection is necessary for the safety and integrity of online
platforms. It is also crucial for research on the influence of bots in
elections, the spread of misinformation, and financial market manipulation.
Platforms deploy infrastructure to flag or remove automated accounts, but their
tools and data are not publicly available. Thus, the public must rely on
third-party bot detection. These tools employ machine learning and often
achieve near perfect performance for classification on existing datasets,
suggesting bot detection is accurate, reliable and fit for use in downstream
applications. We provide evidence that this is not the case and show that high
performance is attributable to limitations in dataset collection and labeling
rather than sophistication of the tools. Specifically, we show that simple
decision rules -- shallow decision trees trained on a small number of features
-- achieve near-state-of-the-art performance on most available datasets and
that bot detection datasets, even when combined together, do not generalize
well to out-of-sample datasets. Our findings reveal that predictions are highly
dependent on each dataset's collection and labeling procedures rather than
fundamental differences between bots and humans. These results have important
implications for both transparency in sampling and labeling procedures and
potential biases in research using existing bot detection tools for
pre-processing.
Related papers
- Approaching Metaheuristic Deep Learning Combos for Automated Data Mining [0.5419570023862531]
This work proposes a means of combining meta-heuristic methods with conventional classifiers and neural networks in order to perform automated data mining.
Experiments on the MNIST dataset for handwritten digit recognition were performed.
It was empirically observed that using a ground truth labeled dataset's validation accuracy is inadequate for correcting labels of other previously unseen data instances.
arXiv Detail & Related papers (2024-10-16T10:28:22Z) - Bayesian Detector Combination for Object Detection with Crowdsourced Annotations [49.43709660948812]
Acquiring fine-grained object detection annotations in unconstrained images is time-consuming, expensive, and prone to noise.
We propose a novel Bayesian Detector Combination (BDC) framework to more effectively train object detectors with noisy crowdsourced annotations.
BDC is model-agnostic, requires no prior knowledge of the annotators' skill level, and seamlessly integrates with existing object detection models.
arXiv Detail & Related papers (2024-07-10T18:00:54Z) - Incremental Self-training for Semi-supervised Learning [56.57057576885672]
IST is simple yet effective and fits existing self-training-based semi-supervised learning methods.
We verify the proposed IST on five datasets and two types of backbone, effectively improving the recognition accuracy and learning speed.
arXiv Detail & Related papers (2024-04-14T05:02:00Z) - BotSSCL: Social Bot Detection with Self-Supervised Contrastive Learning [6.317191658158437]
We propose a novel framework for social Bot detection with Self-Supervised Contrastive Learning (BotSSCL)
BotSSCL uses contrastive learning to distinguish between social bots and humans in the embedding space to improve linear separability.
We demonstrate BotSSCL's robustness against adversarial attempts to manipulate bot accounts to evade detection.
arXiv Detail & Related papers (2024-02-06T06:13:13Z) - BotShape: A Novel Social Bots Detection Approach via Behavioral Patterns [4.386183132284449]
Based on a real-world data set, we construct behavioral sequences from raw event logs.
We observe differences between bots and genuine users and similar patterns among bot accounts.
We present a novel social bot detection system BotShape, to automatically catch behavioral sequences and characteristics.
arXiv Detail & Related papers (2023-03-17T19:03:06Z) - Promises and Pitfalls of Threshold-based Auto-labeling [17.349289155257715]
Threshold-based auto-labeling (TBAL)
We derive complexity bounds on the amount of human-labeled validation data required for guaranteeing the quality of machine-labeled data.
We validate our theoretical guarantees with extensive experiments on synthetic and real datasets.
arXiv Detail & Related papers (2022-11-22T22:53:17Z) - Prompt-driven efficient Open-set Semi-supervised Learning [52.30303262499391]
Open-set semi-supervised learning (OSSL) has attracted growing interest, which investigates a more practical scenario where out-of-distribution (OOD) samples are only contained in unlabeled data.
We propose a prompt-driven efficient OSSL framework, called OpenPrompt, which can propagate class information from labeled to unlabeled data with only a small number of trainable parameters.
arXiv Detail & Related papers (2022-09-28T16:25:08Z) - BeCAPTCHA-Type: Biometric Keystroke Data Generation for Improved Bot
Detection [63.447493500066045]
This work proposes a data driven learning model for the synthesis of keystroke biometric data.
The proposed method is compared with two statistical approaches based on Universal and User-dependent models.
Our experimental framework considers a dataset with 136 million keystroke events from 168 thousand subjects.
arXiv Detail & Related papers (2022-07-27T09:26:15Z) - Hidden Biases in Unreliable News Detection Datasets [60.71991809782698]
We show that selection bias during data collection leads to undesired artifacts in the datasets.
We observed a significant drop (>10%) in accuracy for all models tested in a clean split with no train/test source overlap.
We suggest future dataset creation include a simple model as a difficulty/bias probe and future model development use a clean non-overlapping site and date split.
arXiv Detail & Related papers (2021-04-20T17:16:41Z) - Diverse Complexity Measures for Dataset Curation in Self-driving [80.55417232642124]
We propose a new data selection method that exploits a diverse set of criteria that quantize interestingness of traffic scenes.
Our experiments show that the proposed curation pipeline is able to select datasets that lead to better generalization and higher performance.
arXiv Detail & Related papers (2021-01-16T23:45:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.