Debiasing Text Safety Classifiers through a Fairness-Aware Ensemble
- URL: http://arxiv.org/abs/2409.13705v2
- Date: Tue, 22 Oct 2024 01:01:56 GMT
- Title: Debiasing Text Safety Classifiers through a Fairness-Aware Ensemble
- Authors: Olivia Sturman, Aparna Joshi, Bhaktipriya Radharapu, Piyush Kumar, Renee Shelby,
- Abstract summary: We present a light-weight, post-processing method for mitigating counterfactual fairness in closed-source text safety classifiers.
We introduce two threshold-agnostic metrics to assess the counterfactual fairness of a model, and demonstrate how combining these metrics with Fair Data Reweighting (FDW) helps mitigate biases.
Our results show that our approach improves counterfactual fairness with minimal impact on model performance.
- Score: 2.1450827490014865
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Increasing use of large language models (LLMs) demand performant guardrails to ensure the safety of inputs and outputs of LLMs. When these safeguards are trained on imbalanced data, they can learn the societal biases. We present a light-weight, post-processing method for mitigating counterfactual fairness in closed-source text safety classifiers. Our approach involves building an ensemble that not only outperforms the input classifiers and policy-aligns them, but also acts as a debiasing regularizer. We introduce two threshold-agnostic metrics to assess the counterfactual fairness of a model, and demonstrate how combining these metrics with Fair Data Reweighting (FDW) helps mitigate biases. We create an expanded Open AI dataset, and a new templated LLM-generated dataset based on user-prompts, both of which are counterfactually balanced across identity groups and cover four key areas of safety; we will work towards publicly releasing these datasets. Our results show that our approach improves counterfactual fairness with minimal impact on model performance.
Related papers
- Nob-MIAs: Non-biased Membership Inference Attacks Assessment on Large Language Models with Ex-Post Dataset Construction [37.69303106863453]
Membership Inference Attacks (MIAs) aim to detect whether specific documents were used in a given Large Language Models (LLMs) pretraining.
This paper addresses the evaluation of MIAs on LLMs with partially inferable training sets.
We propose and validate algorithms to create non-biased'' and non-classifiable'' datasets for fairer MIA assessment.
arXiv Detail & Related papers (2024-08-12T07:49:28Z) - Editable Fairness: Fine-Grained Bias Mitigation in Language Models [52.66450426729818]
We propose a novel debiasing approach, Fairness Stamp (FAST), which enables fine-grained calibration of individual social biases.
FAST surpasses state-of-the-art baselines with superior debiasing performance.
This highlights the potential of fine-grained debiasing strategies to achieve fairness in large language models.
arXiv Detail & Related papers (2024-08-07T17:14:58Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - Fairness Without Harm: An Influence-Guided Active Sampling Approach [32.173195437797766]
We aim to train models that mitigate group fairness disparity without causing harm to model accuracy.
The current data acquisition methods, such as fair active learning approaches, typically require annotating sensitive attributes.
We propose a tractable active data sampling algorithm that does not rely on training group annotations.
arXiv Detail & Related papers (2024-02-20T07:57:38Z) - ChatGPT Based Data Augmentation for Improved Parameter-Efficient Debiasing of LLMs [65.9625653425636]
Large Language models (LLMs) exhibit harmful social biases.
This work introduces a novel approach utilizing ChatGPT to generate synthetic training data.
arXiv Detail & Related papers (2024-02-19T01:28:48Z) - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN)
At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself.
This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z) - Group Robust Classification Without Any Group Information [5.053622900542495]
This study contends that current bias-unsupervised approaches to group robustness continue to rely on group information to achieve optimal performance.
bias labels are still crucial for effective model selection, restricting the practicality of these methods in real-world scenarios.
We propose a revised methodology for training and validating debiased models in an entirely bias-unsupervised manner.
arXiv Detail & Related papers (2023-10-28T01:29:18Z) - Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs [59.596335292426105]
This paper collects the first open-source dataset to evaluate safeguards in large language models.
We train several BERT-like classifiers to achieve results comparable with GPT-4 on automatic safety evaluation.
arXiv Detail & Related papers (2023-08-25T14:02:12Z) - Fairness Reprogramming [42.65700878967251]
We propose a new generic fairness learning paradigm, called FairReprogram, which incorporates the model reprogramming technique.
Specifically, FairReprogram considers the case where models can not be changed and appends to the input a set of perturbations, called the fairness trigger.
We show both theoretically and empirically that the fairness trigger can effectively obscure demographic biases in the output prediction of fixed ML models.
arXiv Detail & Related papers (2022-09-21T09:37:00Z) - Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management.
We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.