Balancing Quality and Variation: Spam Filtering Distorts Data Label Distributions
- URL: http://arxiv.org/abs/2509.08217v2
- Date: Thu, 06 Nov 2025 18:49:56 GMT
- Title: Balancing Quality and Variation: Spam Filtering Distorts Data Label Distributions
- Authors: Eve Fleisig, Matthias Orlikowski, Philipp Cimiano, Dan Klein,
- Abstract summary: We evaluate how a range of settings for annotator filtering affect the preservation of variation on subjective tasks.<n>We find that conservative settings for annotator removal (5%) are best, after which all tested methods increase the mean absolute error from the true average label.<n>These results highlight the need for spam removal methods that account for label diversity.
- Score: 18.226008559486562
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: For machine learning datasets to accurately represent diverse opinions in a population, they must preserve variation in data labels while filtering out spam or low-quality responses. How can we balance annotator reliability and representation? We empirically evaluate how a range of heuristics for annotator filtering affect the preservation of variation on subjective tasks. We find that these methods, designed for contexts in which variation from a single ground-truth label is considered noise, often remove annotators who disagree instead of spam annotators, introducing suboptimal tradeoffs between accuracy and label diversity. We find that conservative settings for annotator removal (<5%) are best, after which all tested methods increase the mean absolute error from the true average label. We analyze performance on synthetic spam to observe that these methods often assume spam annotators are more random than real spammers tend to be: most spammers are distributionally indistinguishable from real annotators, and the minority that are distinguishable tend to give relatively fixed answers, not random ones. Thus, tasks requiring the preservation of variation reverse the intuition of existing spam filtering methods: spammers tend to be less random than non-spammers, so metrics that assume variation is spam fare worse. These results highlight the need for spam removal methods that account for label diversity.
Related papers
- MARS-Sep: Multimodal-Aligned Reinforced Sound Separation [72.85468563236005]
MARS-Sep is a reinforcement learning framework for sound separation.<n>It learns a factorized Beta mask policy that is optimized by a clipped trust-region surrogate.<n>Experiments on multiple benchmarks demonstrate consistent gains in Text-, Audio-, and Image-Queried separation.
arXiv Detail & Related papers (2025-10-12T09:05:28Z) - The Lipschitz-Variance-Margin Tradeoff for Enhanced Randomized Smoothing [85.85160896547698]
Real-life applications of deep neural networks are hindered by their unsteady predictions when faced with noisy inputs and adversarial attacks.
We show how to design an efficient classifier with a certified radius by relying on noise injection into the inputs.
Our novel certification procedure allows us to use pre-trained models with randomized smoothing, effectively improving the current certification radius in a zero-shot manner.
arXiv Detail & Related papers (2023-09-28T22:41:47Z) - Crowdsourcing subjective annotations using pairwise comparisons reduces
bias and error compared to the majority-vote method [0.0]
We introduce a theoretical framework for understanding how random error and measurement bias enter into crowdsourced annotations of subjective constructs.
We then propose a pipeline that combines pairwise comparison labelling with Elo scoring, and demonstrate that it outperforms the ubiquitous majority-voting method in reducing both types of measurement error.
arXiv Detail & Related papers (2023-05-31T17:14:12Z) - Bayesian Self-Supervised Contrastive Learning [16.903874675729952]
This paper proposes a new self-supervised contrastive loss called the BCL loss.
The key idea is to design the desired sampling distribution for sampling hard true negative samples under the Bayesian framework.
Experiments validate the effectiveness and superiority of the BCL loss.
arXiv Detail & Related papers (2023-01-27T12:13:06Z) - Confidence-aware Training of Smoothed Classifiers for Certified
Robustness [75.95332266383417]
We use "accuracy under Gaussian noise" as an easy-to-compute proxy of adversarial robustness for an input.
Our experiments show that the proposed method consistently exhibits improved certified robustness upon state-of-the-art training methods.
arXiv Detail & Related papers (2022-12-18T03:57:12Z) - Filter and evolve: progressive pseudo label refining for semi-supervised
automatic speech recognition [5.735000563764309]
Low quality pseudo labels can misguide decision boundaries and degrade performance.
We propose a simple yet effective strategy to filter low quality pseudo labels.
Experiments on LibriSpeech show that these filtered samples enable the refined model to yield more correct predictions.
arXiv Detail & Related papers (2022-10-28T16:15:58Z) - Bias Mimicking: A Simple Sampling Approach for Bias Mitigation [57.17709477668213]
We introduce a new class-conditioned sampling method: Bias Mimicking.
Bias Mimicking improves underrepresented groups' accuracy of sampling methods by 3% over four benchmarks.
arXiv Detail & Related papers (2022-09-30T17:33:00Z) - Disentangling Sampling and Labeling Bias for Learning in Large-Output
Spaces [64.23172847182109]
We show that different negative sampling schemes implicitly trade-off performance on dominant versus rare labels.
We provide a unified means to explicitly tackle both sampling bias, arising from working with a subset of all labels, and labeling bias, which is inherent to the data due to label imbalance.
arXiv Detail & Related papers (2021-05-12T15:40:13Z) - Robust Spammer Detection by Nash Reinforcement Learning [64.80986064630025]
We develop a minimax game where the spammers and spam detectors compete with each other on their practical goals.
We show that an optimization algorithm can reliably find an equilibrial detector that can robustly prevent spammers with any mixed spamming strategies from attaining their practical goal.
arXiv Detail & Related papers (2020-06-10T21:18:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.