Fair Hate Speech Detection through Evaluation of Social Group
Counterfactuals
- URL: http://arxiv.org/abs/2010.12779v1
- Date: Sat, 24 Oct 2020 04:51:47 GMT
- Title: Fair Hate Speech Detection through Evaluation of Social Group
Counterfactuals
- Authors: Aida Mostafazadeh Davani, Ali Omrani, Brendan Kennedy, Mohammad Atari,
Xiang Ren, Morteza Dehghani
- Abstract summary: Approaches for mitigating bias in supervised models are designed to reduce models' dependence on specific sensitive features of the input data.
In the case of hate speech detection, it is not always desirable to equalize the effects of social groups.
Counterfactual token fairness for a mentioned social group evaluates the model's predictions as to whether they are the same for (a) the actual sentence and (b) a counterfactual instance.
Our approach assures robust model predictions for counterfactuals that imply similar meaning as the actual sentence.
- Score: 21.375422346539004
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Approaches for mitigating bias in supervised models are designed to reduce
models' dependence on specific sensitive features of the input data, e.g.,
mentioned social groups. However, in the case of hate speech detection, it is
not always desirable to equalize the effects of social groups because of their
essential role in distinguishing outgroup-derogatory hate, such that particular
types of hateful rhetoric carry the intended meaning only when contextualized
around certain social group tokens. Counterfactual token fairness for a
mentioned social group evaluates the model's predictions as to whether they are
the same for (a) the actual sentence and (b) a counterfactual instance, which
is generated by changing the mentioned social group in the sentence. Our
approach assures robust model predictions for counterfactuals that imply
similar meaning as the actual sentence. To quantify the similarity of a
sentence and its counterfactual, we compare their likelihood score calculated
by generative language models. By equalizing model behaviors on each sentence
and its counterfactuals, we mitigate bias in the proposed model while
preserving the overall classification performance.
Related papers
- Counterfactual Generation from Language Models [64.55296662926919]
We show that counterfactual reasoning is conceptually distinct from interventions.
We propose a framework for generating true string counterfactuals.
Our experiments demonstrate that the approach produces meaningful counterfactuals.
arXiv Detail & Related papers (2024-11-11T17:57:30Z) - The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Pre-trained Language Models [78.69526166193236]
Pre-trained Language models (PLMs) have been acknowledged to contain harmful information, such as social biases.
We propose sc Social Bias Neurons to accurately pinpoint units (i.e., neurons) in a language model that can be attributed to undesirable behavior, such as social bias.
As measured by prior metrics from StereoSet, our model achieves a higher degree of fairness while maintaining language modeling ability with low cost.
arXiv Detail & Related papers (2024-06-14T15:41:06Z) - SocialStigmaQA: A Benchmark to Uncover Stigma Amplification in
Generative Language Models [8.211129045180636]
We introduce a benchmark meant to capture the amplification of social bias, via stigmas, in generative language models.
Our benchmark, SocialStigmaQA, contains roughly 10K prompts, with a variety of prompt styles, carefully constructed to test for both social bias and model robustness.
We find that the proportion of socially biased output ranges from 45% to 59% across a variety of decoding strategies and prompting styles.
arXiv Detail & Related papers (2023-12-12T18:27:44Z) - Social Bias Probing: Fairness Benchmarking for Language Models [38.180696489079985]
This paper proposes a novel framework for probing language models for social biases by assessing disparate treatment.
We curate SoFa, a large-scale benchmark designed to address the limitations of existing fairness collections.
We show that biases within language models are more nuanced than acknowledged, indicating a broader scope of encoded biases than previously recognized.
arXiv Detail & Related papers (2023-11-15T16:35:59Z) - Logic Against Bias: Textual Entailment Mitigates Stereotypical Sentence
Reasoning [8.990338162517086]
We describe several kinds of stereotypes concerning different communities that are present in popular sentence representation models.
By comparing strong pretrained models based on text similarity with textual entailment learning, we conclude that the explicit logic learning with textual entailment can significantly reduce bias.
arXiv Detail & Related papers (2023-03-10T02:52:13Z) - Estimating Structural Disparities for Face Models [54.062512989859265]
In machine learning, disparity metrics are often defined by measuring the difference in the performance or outcome of a model, across different sub-populations.
We explore performing such analysis on computer vision models trained on human faces, and on tasks such as face attribute prediction and affect estimation.
arXiv Detail & Related papers (2022-04-13T05:30:53Z) - Measuring Fairness of Text Classifiers via Prediction Sensitivity [63.56554964580627]
ACCUMULATED PREDICTION SENSITIVITY measures fairness in machine learning models based on the model's prediction sensitivity to perturbations in input features.
We show that the metric can be theoretically linked with a specific notion of group fairness (statistical parity) and individual fairness.
arXiv Detail & Related papers (2022-03-16T15:00:33Z) - Fair Group-Shared Representations with Normalizing Flows [68.29997072804537]
We develop a fair representation learning algorithm which is able to map individuals belonging to different groups in a single group.
We show experimentally that our methodology is competitive with other fair representation learning algorithms.
arXiv Detail & Related papers (2022-01-17T10:49:49Z) - Improving Counterfactual Generation for Fair Hate Speech Detection [26.79268141793483]
Bias mitigation approaches reduce models' dependence on sensitive features of data, such as social group tokens (SGTs)
In hate speech detection, however, equalizing model predictions may ignore important differences among targeted social groups.
Here, we rely on counterfactual fairness and equalize predictions among counterfactuals, generated by changing the SGTs.
arXiv Detail & Related papers (2021-08-03T19:47:27Z) - Fundamental Tradeoffs between Invariance and Sensitivity to Adversarial
Perturbations [65.05561023880351]
Adversarial examples are malicious inputs crafted to induce misclassification.
This paper studies a complementary failure mode, invariance-based adversarial examples.
We show that defenses against sensitivity-based attacks actively harm a model's accuracy on invariance-based attacks.
arXiv Detail & Related papers (2020-02-11T18:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.