Related papers: Debiasing Text Safety Classifiers through a Fairness-Aware Ensemble

Debiasing Text Safety Classifiers through a Fairness-Aware Ensemble

URL: http://arxiv.org/abs/2409.13705v2
Date: Tue, 22 Oct 2024 01:01:56 GMT
Title: Debiasing Text Safety Classifiers through a Fairness-Aware Ensemble
Authors: Olivia Sturman, Aparna Joshi, Bhaktipriya Radharapu, Piyush Kumar, Renee Shelby,
Abstract summary: We present a light-weight, post-processing method for mitigating counterfactual fairness in closed-source text safety classifiers. We introduce two threshold-agnostic metrics to assess the counterfactual fairness of a model, and demonstrate how combining these metrics with Fair Data Reweighting (FDW) helps mitigate biases. Our results show that our approach improves counterfactual fairness with minimal impact on model performance.
Score: 2.1450827490014865
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Increasing use of large language models (LLMs) demand performant guardrails to ensure the safety of inputs and outputs of LLMs. When these safeguards are trained on imbalanced data, they can learn the societal biases. We present a light-weight, post-processing method for mitigating counterfactual fairness in closed-source text safety classifiers. Our approach involves building an ensemble that not only outperforms the input classifiers and policy-aligns them, but also acts as a debiasing regularizer. We introduce two threshold-agnostic metrics to assess the counterfactual fairness of a model, and demonstrate how combining these metrics with Fair Data Reweighting (FDW) helps mitigate biases. We create an expanded Open AI dataset, and a new templated LLM-generated dataset based on user-prompts, both of which are counterfactually balanced across identity groups and cover four key areas of safety; we will work towards publicly releasing these datasets. Our results show that our approach improves counterfactual fairness with minimal impact on model performance.

Related papers

Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning [22.13346397293792]
Vulnerability-Aware Alignment estimates data vulnerability, partitions data into "vulnerable" and "invulnerable" groups, and encourages balanced learning.<n>VAA significantly reduces harmful scores while preserving downstream task performance, outperforming state-of-the-art baselines.
arXiv Detail & Related papers (2025-06-04T11:33:36Z)
Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs [7.197702136906138]
We propose an uncertainty-aware fairness metric, UCerF, to enable a fine-grained evaluation of model fairness.<n> observing data size, diversity, and clarity issues in current datasets, we introduce a new gender-occupation fairness evaluation dataset.<n>We establish a benchmark, using our metric and dataset, and apply it to evaluate the behavior of ten open-source AI systems.
arXiv Detail & Related papers (2025-05-29T20:45:18Z)
Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? [83.53005932513155]
Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited. We propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences.
arXiv Detail & Related papers (2025-04-14T09:03:51Z)
Watching the AI Watchdogs: A Fairness and Robustness Analysis of AI Safety Moderation Classifiers [5.35599092568615]
Safety Moderation (ASM) classifiers are designed to moderate content on social media platforms. It is crucial to ensure that these classifiers do not unfairly classify content belonging to users from minority groups. We thus examine the fairness and robustness of four widely-used, closed-source ASM classifiers.
arXiv Detail & Related papers (2025-01-23T01:04:00Z)
Nob-MIAs: Non-biased Membership Inference Attacks Assessment on Large Language Models with Ex-Post Dataset Construction [37.69303106863453]
Membership Inference Attacks (MIAs) aim to detect whether specific documents were used in a given Large Language Models (LLMs) pretraining. This paper addresses the evaluation of MIAs on LLMs with partially inferable training sets. We propose and validate algorithms to create non-biased'' and non-classifiable'' datasets for fairer MIA assessment.
arXiv Detail & Related papers (2024-08-12T07:49:28Z)
Editable Fairness: Fine-Grained Bias Mitigation in Language Models [52.66450426729818]
We propose a novel debiasing approach, Fairness Stamp (FAST), which enables fine-grained calibration of individual social biases. FAST surpasses state-of-the-art baselines with superior debiasing performance. This highlights the potential of fine-grained debiasing strategies to achieve fairness in large language models.
arXiv Detail & Related papers (2024-08-07T17:14:58Z)
Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios. We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples. Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z)
Fairness Without Harm: An Influence-Guided Active Sampling Approach [32.173195437797766]
We aim to train models that mitigate group fairness disparity without causing harm to model accuracy. The current data acquisition methods, such as fair active learning approaches, typically require annotating sensitive attributes. We propose a tractable active data sampling algorithm that does not rely on training group annotations.
arXiv Detail & Related papers (2024-02-20T07:57:38Z)
ChatGPT Based Data Augmentation for Improved Parameter-Efficient Debiasing of LLMs [65.9625653425636]
Large Language models (LLMs) exhibit harmful social biases. This work introduces a novel approach utilizing ChatGPT to generate synthetic training data.
arXiv Detail & Related papers (2024-02-19T01:28:48Z)
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN) At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself. This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z)
Group Robust Classification Without Any Group Information [5.053622900542495]
This study contends that current bias-unsupervised approaches to group robustness continue to rely on group information to achieve optimal performance. bias labels are still crucial for effective model selection, restricting the practicality of these methods in real-world scenarios. We propose a revised methodology for training and validating debiased models in an entirely bias-unsupervised manner.
arXiv Detail & Related papers (2023-10-28T01:29:18Z)
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs [59.596335292426105]
This paper collects the first open-source dataset to evaluate safeguards in large language models. We train several BERT-like classifiers to achieve results comparable with GPT-4 on automatic safety evaluation.
arXiv Detail & Related papers (2023-08-25T14:02:12Z)
Fairness Reprogramming [42.65700878967251]
We propose a new generic fairness learning paradigm, called FairReprogram, which incorporates the model reprogramming technique. Specifically, FairReprogram considers the case where models can not be changed and appends to the input a set of perturbations, called the fairness trigger. We show both theoretically and empirically that the fairness trigger can effectively obscure demographic biases in the output prediction of fixed ML models.
arXiv Detail & Related papers (2022-09-21T09:37:00Z)
Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management. We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.