Related papers: Towards Resource Efficient and Interpretable Bias Mitigation in Large Language Models

Towards Resource Efficient and Interpretable Bias Mitigation in Large Language Models

URL: http://arxiv.org/abs/2412.01711v1
Date: Mon, 02 Dec 2024 16:56:08 GMT
Title: Towards Resource Efficient and Interpretable Bias Mitigation in Large Language Models
Authors: Schrasing Tong, Eliott Zemour, Rawisara Lohanimit, Lalana Kagal,
Abstract summary: Large language models (LLMs) have been observed to perpetuate unwanted biases in training data.<n>In this paper, we mitigate bias by leveraging small biased and anti-biased expert models to obtain a debiasing signal.<n> Experiments on mitigating gender, race, and religion biases show a reduction in bias on several local and global bias metrics.
Score: 1.787433808079955
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Although large language models (LLMs) have demonstrated their effectiveness in a wide range of applications, they have also been observed to perpetuate unwanted biases present in the training data, potentially leading to harm for marginalized communities. In this paper, we mitigate bias by leveraging small biased and anti-biased expert models to obtain a debiasing signal that will be added to the LLM output at decoding-time. This approach combines resource efficiency with interpretability and can be optimized for mitigating specific types of bias, depending on the target use case. Experiments on mitigating gender, race, and religion biases show a reduction in bias on several local and global bias metrics while preserving language model performance.

Related papers

No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases [6.184434080778806]
Large Language Models (LLMs) inherit societal biases from their training data, potentially leading to harmful or unfair outputs.<n>We study four bias mitigation techniques applied across ten models from seven model families, and we explore racial, religious, profession- and gender-related biases.<n>Our results consistently show that while targeted mitigation can sometimes reduce bias in the intended dimension, it frequently leads to unintended and often negative consequences in others.
arXiv Detail & Related papers (2025-11-23T22:21:18Z)
Detecting Prefix Bias in LLM-based Reward Models [4.596249232904721]
We introduce novel methods to detect and evaluate prefix bias in reward models trained on preference datasets.<n>We leverage these metrics to reveal significant biases in preference models across racial and gender dimensions.<n>Our findings highlight the critical need for bias-aware dataset design and evaluation in developing fair and reliable reward models.
arXiv Detail & Related papers (2025-05-13T21:50:03Z)
Beneath the Surface: How Large Language Models Reflect Hidden Bias [7.026605828163043]
We introduce the Hidden Bias Benchmark (HBB), a novel dataset designed to assess hidden bias that bias concepts are hidden within naturalistic, subtly framed contexts in real-world scenarios. We analyze six state-of-the-art Large Language Models, revealing that while models reduce bias in response to overt bias, they continue to reinforce biases in nuanced settings.
arXiv Detail & Related papers (2025-02-27T04:25:54Z)
Bias Similarity Across Large Language Models [32.0365189539138]
We analyze bias through output distribution across multiple dimensions using two datasets (4K and 1M questions) Our results show that fine-tuning has minimal impact on output distributions, and proprietary models tend to overly response as unknowns to minimize bias, compromising accuracy and utility. Open-source models like Llama3-Chat and Gemma2-it demonstrate fairness comparable to proprietary models like GPT-4, challenging the assumption that larger, closed-source models are inherently less biased.
arXiv Detail & Related papers (2024-10-15T19:21:14Z)
REFINE-LM: Mitigating Language Model Stereotypes via Reinforcement Learning [18.064064773660174]
We introduce REFINE-LM, a debiasing method that uses reinforcement learning to handle different types of biases without any fine-tuning. By training a simple model on top of the word probability distribution of a LM, our bias reinforcement learning method enables model debiasing without human annotations. Experiments conducted on a wide range of models, including several LMs, show that our method significantly reduces stereotypical biases while preserving LMs performance.
arXiv Detail & Related papers (2024-08-18T14:08:31Z)
BiasDPO: Mitigating Bias in Language Models through Direct Preference Optimization [0.0]
Large Language Models (LLMs) have become pivotal in advancing natural language processing, yet their potential to perpetuate biases poses significant concerns. This paper introduces a new framework employing Direct Preference Optimization (DPO) to mitigate gender, racial, and religious biases in English text. By developing a loss function that favors less biased over biased completions, our approach cultivates a preference for respectful and non-discriminatory language.
arXiv Detail & Related papers (2024-07-18T22:32:20Z)
ChatGPT Based Data Augmentation for Improved Parameter-Efficient Debiasing of LLMs [65.9625653425636]
Large Language models (LLMs) exhibit harmful social biases. This work introduces a novel approach utilizing ChatGPT to generate synthetic training data.
arXiv Detail & Related papers (2024-02-19T01:28:48Z)
GPTBIAS: A Comprehensive Framework for Evaluating Bias in Large Language Models [83.30078426829627]
Large language models (LLMs) have gained popularity and are being widely adopted by a large user community. The existing evaluation methods have many constraints, and their results exhibit a limited degree of interpretability. We propose a bias evaluation framework named GPTBIAS that leverages the high performance of LLMs to assess bias in models.
arXiv Detail & Related papers (2023-12-11T12:02:14Z)
Improving Bias Mitigation through Bias Experts in Natural Language Understanding [10.363406065066538]
We propose a new debiasing framework that introduces binary classifiers between the auxiliary model and the main model. Our proposed strategy improves the bias identification ability of the auxiliary model.
arXiv Detail & Related papers (2023-12-06T16:15:00Z)
Fast Model Debias with Machine Unlearning [54.32026474971696]
Deep neural networks might behave in a biased manner in many real-world scenarios. Existing debiasing methods suffer from high costs in bias labeling or model re-training. We propose a fast model debiasing framework (FMD) which offers an efficient approach to identify, evaluate and remove biases.
arXiv Detail & Related papers (2023-10-19T08:10:57Z)
Soft-prompt Tuning for Large Language Models to Evaluate Bias [0.03141085922386211]
Using soft-prompts to evaluate bias gives us the extra advantage of avoiding the human-bias injection. We check the model biases on different sensitive attributes using the group fairness (bias) and find interesting bias patterns.
arXiv Detail & Related papers (2023-06-07T19:11:25Z)
Debiasing Vision-Language Models via Biased Prompts [79.04467131711775]
We propose a general approach for debiasing vision-language foundation models by projecting out biased directions in the text embedding. We show that debiasing only the text embedding with a calibrated projection matrix suffices to yield robust classifiers and fair generative models.
arXiv Detail & Related papers (2023-01-31T20:09:33Z)
Feature-Level Debiased Natural Language Understanding [86.8751772146264]
Existing natural language understanding (NLU) models often rely on dataset biases to achieve high performance on specific datasets. We propose debiasing contrastive learning (DCT) to mitigate biased latent features and neglect the dynamic nature of bias. DCT outperforms state-of-the-art baselines on out-of-distribution datasets while maintaining in-distribution performance.
arXiv Detail & Related papers (2022-12-11T06:16:14Z)
Balancing out Bias: Achieving Fairness Through Training Reweighting [58.201275105195485]
Bias in natural language processing arises from models learning characteristics of the author such as gender and race. Existing methods for mitigating and measuring bias do not directly account for correlations between author demographics and linguistic variables. This paper introduces a very simple but highly effective method for countering bias using instance reweighting.
arXiv Detail & Related papers (2021-09-16T23:40:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.