Related papers: Are you sure? Measuring models bias in content moderation through uncertainty

Are you sure? Measuring models bias in content moderation through uncertainty

URL: http://arxiv.org/abs/2509.22699v2
Date: Tue, 28 Oct 2025 14:11:48 GMT
Title: Are you sure? Measuring models bias in content moderation through uncertainty
Authors: Alessandra Urbinati, Mirko Lai, Simona Frenda, Marco Antonio Stranisci,
Abstract summary: We present an unsupervised approach that benchmarks models on the basis of their uncertainty in classifying messages annotated by people belonging to vulnerable groups.<n>We use uncertainty, computed by means of the conformal prediction technique, as a proxy to analyze the bias of 11 models against women and non-white annotators.<n>The results show that some pre-trained models predict with high accuracy the labels coming from minority groups, even if the confidence in their prediction is low.
Score: 41.43421165541282
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automatic content moderation is crucial to ensuring safety in social media. Language Model-based classifiers are being increasingly adopted for this task, but it has been shown that they perpetuate racial and social biases. Even if several resources and benchmark corpora have been developed to challenge this issue, measuring the fairness of models in content moderation remains an open issue. In this work, we present an unsupervised approach that benchmarks models on the basis of their uncertainty in classifying messages annotated by people belonging to vulnerable groups. We use uncertainty, computed by means of the conformal prediction technique, as a proxy to analyze the bias of 11 models against women and non-white annotators and observe to what extent it diverges from metrics based on performance, such as the $F_1$ score. The results show that some pre-trained models predict with high accuracy the labels coming from minority groups, even if the confidence in their prediction is low. Therefore, by measuring the confidence of models, we are able to see which groups of annotators are better represented in pre-trained models and lead the debiasing process of these models before their effective use.

Related papers

Downsized and Compromised?: Assessing the Faithfulness of Model Compression [0.0]
This paper presents a novel approach to evaluating faithfulness in compressed models, moving beyond standard metrics.<n>We introduce and demonstrate a set of faithfulness metrics that capture how model behavior changes post-compression.<n>Our contributions include introducing techniques to assess predictive consistency between the original and compressed models using model agreement, and applying chi-squared tests to detect statistically significant changes in predictive patterns across both the overall dataset and demographic subgroups.
arXiv Detail & Related papers (2025-10-07T17:05:02Z)
Fairness without Sensitive Attributes via Knowledge Sharing [13.141672574114597]
We propose a confidence-based hierarchical classifier structure called "Reckoner" for reliable fair model learning under the assumption of missing sensitive attributes. Our experimental results show that Reckoner consistently outperforms state-of-the-art baselines in COMPAS dataset and New Adult dataset.
arXiv Detail & Related papers (2024-09-27T06:16:14Z)
Identifying and Mitigating Social Bias Knowledge in Language Models [52.52955281662332]
We propose a novel debiasing approach, Fairness Stamp (FAST), which enables fine-grained calibration of individual social biases.<n>FAST surpasses state-of-the-art baselines with superior debiasing performance.<n>This highlights the potential of fine-grained debiasing strategies to achieve fairness in large language models.
arXiv Detail & Related papers (2024-08-07T17:14:58Z)
Beyond Confidence: Reliable Models Should Also Consider Atypicality [43.012818086415514]
We investigate the relationship between how atypical(rare) a sample or a class is and the reliability of a model's predictions. We show that predictions for atypical inputs or atypical classes are more overconfident and have lower accuracy. We propose that models should use not only confidence but also atypicality to improve uncertainty quantification and performance.
arXiv Detail & Related papers (2023-05-29T17:37:09Z)
Gender Biases in Automatic Evaluation Metrics for Image Captioning [87.15170977240643]
We conduct a systematic study of gender biases in model-based evaluation metrics for image captioning tasks. We demonstrate the negative consequences of using these biased metrics, including the inability to differentiate between biased and unbiased generations. We present a simple and effective way to mitigate the metric bias without hurting the correlations with human judgments.
arXiv Detail & Related papers (2023-05-24T04:27:40Z)
Statistical Inference for Fairness Auditing [4.318555434063274]
We frame this task as "fairness auditing," in terms of multiple hypothesis testing. We show how the bootstrap can be used to simultaneously bound performance disparities over a collection of groups. Our methods can be used to flag subpopulations affected by model underperformance, and certify subpopulations for which the model performs adequately.
arXiv Detail & Related papers (2023-05-05T17:54:22Z)
De-biasing "bias" measurement [20.049916973204102]
We show that metrics used to measure group-wise model performance disparities are themselves statistically biased estimators of the underlying quantities they purport to represent. We propose the "double-corrected" variance estimator, which provides unbiased estimates and uncertainty quantification of the variance of model performance across groups.
arXiv Detail & Related papers (2022-05-11T20:51:57Z)
Measuring Fairness Under Unawareness of Sensitive Attributes: A Quantification-Based Approach [131.20444904674494]
We tackle the problem of measuring group fairness under unawareness of sensitive attributes. We show that quantification approaches are particularly suited to tackle the fairness-under-unawareness problem.
arXiv Detail & Related papers (2021-09-17T13:45:46Z)
Characterizing Fairness Over the Set of Good Models Under Selective Labels [69.64662540443162]
We develop a framework for characterizing predictive fairness properties over the set of models that deliver similar overall performance. We provide tractable algorithms to compute the range of attainable group-level predictive disparities. We extend our framework to address the empirically relevant challenge of selectively labelled data.
arXiv Detail & Related papers (2021-01-02T02:11:37Z)
Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples. We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries. We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.