Related papers: Augmenting Bias Detection in LLMs Using Topological Data Analysis

Augmenting Bias Detection in LLMs Using Topological Data Analysis

URL: http://arxiv.org/abs/2508.07516v1
Date: Mon, 11 Aug 2025 00:19:47 GMT
Title: Augmenting Bias Detection in LLMs Using Topological Data Analysis
Authors: Keshav Varadarajan, Tananun Songdechakraiwut,
Abstract summary: We present a method using topological data analysis to identify which heads contribute to the misrepresentation of identity groups present in the StereoSet dataset.<n>We find that biases for particular categories, such as gender or profession, are concentrated in attention heads that act as hot spots.<n>The metric we propose can also be used to determine which heads capture bias for a specific group within a bias category.
Score: 0.9208007322096533
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, many bias detection methods have been proposed to determine the level of bias a large language model captures. However, tests to identify which parts of a large language model are responsible for bias towards specific groups remain underdeveloped. In this study, we present a method using topological data analysis to identify which heads in GPT-2 contribute to the misrepresentation of identity groups present in the StereoSet dataset. We find that biases for particular categories, such as gender or profession, are concentrated in attention heads that act as hot spots. The metric we propose can also be used to determine which heads capture bias for a specific group within a bias category, and future work could extend this method to help de-bias large language models.

Related papers

A Feature-level Bias Evaluation Framework for Facial Expression Recognition Models [0.0]
We introduce a plug-and-play statistical module to ensure the statistical significance of biased evaluation results.<n>A comprehensive bias analysis is then conducted across three sensitive attributes (age, gender, and race), seven facial expressions, and multiple network architectures on a large-scale dataset.
arXiv Detail & Related papers (2025-05-26T20:26:07Z)
Assessing Bias in Metric Models for LLM Open-Ended Generation Bias Benchmarks [3.973239756262797]
This study examines such biases in open-generation benchmarks like BOLD and SAGED. Results reveal unequal treatment of demographic descriptors, calling for more robust bias metric models.
arXiv Detail & Related papers (2024-10-14T20:08:40Z)
Is There a One-Model-Fits-All Approach to Information Extraction? Revisiting Task Definition Biases [62.806300074459116]
Definition bias is a negative phenomenon that can mislead models. We identify two types of definition bias in IE: bias among information extraction datasets and bias between information extraction datasets and instruction tuning datasets. We propose a multi-stage framework consisting of definition bias measurement, bias-aware fine-tuning, and task-specific bias mitigation.
arXiv Detail & Related papers (2024-03-25T03:19:20Z)
This Prompt is Measuring <MASK>: Evaluating Bias Evaluation in Language Models [12.214260053244871]
We analyse the body of work that uses prompts and templates to assess bias in language models. We draw on a measurement modelling framework to create a taxonomy of attributes that capture what a bias test aims to measure. Our analysis illuminates the scope of possible bias types the field is able to measure, and reveals types that are as yet under-researched.
arXiv Detail & Related papers (2023-05-22T06:28:48Z)
Feature Importance Disparities for Data Bias Investigations [2.184775414778289]
It is widely held that one cause of downstream bias in classifiers is bias present in the training data. We present one such method that given a dataset $X$ consisting of protected and unprotected features, outcomes $y$, and a regressor $h$ that predicts $y$ given $X$. We show across $4$ datasets and $4$ common feature importance methods of broad interest to the machine learning community that we can efficiently find subgroups with large FID values even over exponentially large subgroup classes.
arXiv Detail & Related papers (2023-03-03T04:12:04Z)
Counter-GAP: Counterfactual Bias Evaluation through Gendered Ambiguous Pronouns [53.62845317039185]
Bias-measuring datasets play a critical role in detecting biased behavior of language models. We propose a novel method to collect diverse, natural, and minimally distant text pairs via counterfactual generation. We show that four pre-trained language models are significantly more inconsistent across different gender groups than within each group.
arXiv Detail & Related papers (2023-02-11T12:11:03Z)
Causally Testing Gender Bias in LLMs: A Case Study on Occupational Bias [33.99768156365231]
We introduce a causal formulation for bias measurement in generative language models.<n>We propose a benchmark called OccuGender, with a bias-measuring procedure to investigate occupational gender bias.<n>The results show that these models exhibit substantial occupational gender bias.
arXiv Detail & Related papers (2022-12-20T22:41:24Z)
"I'm sorry to hear that": Finding New Biases in Language Models with a Holistic Descriptor Dataset [12.000335510088648]
We present a new, more inclusive bias measurement dataset, HolisticBias, which includes nearly 600 descriptor terms across 13 different demographic axes. HolisticBias was assembled in a participatory process including experts and community members with lived experience of these terms. We demonstrate that HolisticBias is effective at measuring previously undetectable biases in token likelihoods from language models.
arXiv Detail & Related papers (2022-05-18T20:37:25Z)
General Greedy De-bias Learning [163.65789778416172]
We propose a General Greedy De-bias learning framework (GGD), which greedily trains the biased models and the base model like gradient descent in functional space. GGD can learn a more robust base model under the settings of both task-specific biased models with prior knowledge and self-ensemble biased model without prior knowledge.
arXiv Detail & Related papers (2021-12-20T14:47:32Z)
Balancing Biases and Preserving Privacy on Balanced Faces in the Wild [50.915684171879036]
There are demographic biases present in current facial recognition (FR) models. We introduce our Balanced Faces in the Wild dataset to measure these biases across different ethnic and gender subgroups. We find that relying on a single score threshold to differentiate between genuine and imposters sample pairs leads to suboptimal results. We propose a novel domain adaptation learning scheme that uses facial features extracted from state-of-the-art neural networks.
arXiv Detail & Related papers (2021-03-16T15:05:49Z)
The Gap on GAP: Tackling the Problem of Differing Data Distributions in Bias-Measuring Datasets [58.53269361115974]
Diagnostic datasets that can detect biased models are an important prerequisite for bias reduction within natural language processing. undesired patterns in the collected data can make such tests incorrect. We introduce a theoretically grounded method for weighting test samples to cope with such patterns in the test data.
arXiv Detail & Related papers (2020-11-03T16:50:13Z)
LOGAN: Local Group Bias Detection by Clustering [86.38331353310114]
We argue that evaluating bias at the corpus level is not enough for understanding how biases are embedded in a model. We propose LOGAN, a new bias detection technique based on clustering. Experiments on toxicity classification and object classification tasks show that LOGAN identifies bias in a local region.
arXiv Detail & Related papers (2020-10-06T16:42:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.