Interpretable Debiasing of Vision-Language Models for Social Fairness
- URL: http://arxiv.org/abs/2602.24014v1
- Date: Fri, 27 Feb 2026 13:37:11 GMT
- Title: Interpretable Debiasing of Vision-Language Models for Social Fairness
- Authors: Na Min An, Yoonna Jang, Yusuke Hirota, Ryo Hachiuma, Isabelle Augenstein, Hyunjung Shim,
- Abstract summary: We introduce an interpretable, model-agnostic bias mitigation framework, DeBiasLens, that localizes social attribute neurons in Vision-Language models.<n>We train SAEs on facial image or caption datasets without corresponding social attribute labels to uncover neurons highly responsive to specific demographics.<n>Our research lays the groundwork for future auditing tools, prioritizing social fairness in emerging real-world AI systems.
- Score: 55.85977929985967
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid advancement of Vision-Language models (VLMs) has raised growing concerns that their black-box reasoning processes could lead to unintended forms of social bias. Current debiasing approaches focus on mitigating surface-level bias signals through post-hoc learning or test-time algorithms, while leaving the internal dynamics of the model largely unexplored. In this work, we introduce an interpretable, model-agnostic bias mitigation framework, DeBiasLens, that localizes social attribute neurons in VLMs through sparse autoencoders (SAEs) applied to multimodal encoders. Building upon the disentanglement ability of SAEs, we train them on facial image or caption datasets without corresponding social attribute labels to uncover neurons highly responsive to specific demographics, including those that are underrepresented. By selectively deactivating the social neurons most strongly tied to bias for each group, we effectively mitigate socially biased behaviors of VLMs without degrading their semantic knowledge. Our research lays the groundwork for future auditing tools, prioritizing social fairness in emerging real-world AI systems.
Related papers
- SocialFusion: Addressing Social Degradation in Pre-trained Vision-Language Models [34.928133808112925]
We show that pre-trained vision-language models (VLMs) struggle to unify and learn multiple social perception tasks simultaneously.<n>We propose SocialFusion, a unified framework that learns a minimal connection between a frozen visual encoder and a language model.<n>Our findings suggest that current VLM pre-training strategies may be detrimental to acquiring general social competence.
arXiv Detail & Related papers (2025-11-30T23:54:54Z) - Addressing Bias in LLMs: Strategies and Application to Fair AI-based Recruitment [49.81946749379338]
This work seeks to analyze the capacity of Transformers-based systems to learn demographic biases present in the data.<n>We propose a privacy-enhancing framework to reduce gender information from the learning pipeline as a way to mitigate biased behaviors in the final tools.
arXiv Detail & Related papers (2025-06-13T15:29:43Z) - Interpreting Social Bias in LVLMs via Information Flow Analysis and Multi-Round Dialogue Evaluation [1.7997395646080083]
Large Vision Language Models (LVLMs) have achieved remarkable progress in multimodal tasks, yet they also exhibit notable social biases.<n>We propose an explanatory framework that combines information flow analysis with multi-round dialogue evaluation.<n>Experiments reveal that LVLMs exhibit systematic disparities in information usage when processing images of different demographic groups.
arXiv Detail & Related papers (2025-05-27T12:28:44Z) - Social Debiasing for Fair Multi-modal LLMs [59.61512883471714]
Multi-modal Large Language Models (MLLMs) have dramatically advanced the research field and delivered powerful vision-language understanding capabilities.<n>These models often inherit deep-rooted social biases from their training data, leading to uncomfortable responses with respect to attributes such as race and gender.<n>This paper addresses the issue of social biases in MLLMs by introducing a comprehensive counterfactual dataset with multiple social concepts.
arXiv Detail & Related papers (2024-08-13T02:08:32Z) - The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Pre-trained Language Models [78.69526166193236]
Pre-trained Language models (PLMs) have been acknowledged to contain harmful information, such as social biases.
We propose sc Social Bias Neurons to accurately pinpoint units (i.e., neurons) in a language model that can be attributed to undesirable behavior, such as social bias.
As measured by prior metrics from StereoSet, our model achieves a higher degree of fairness while maintaining language modeling ability with low cost.
arXiv Detail & Related papers (2024-06-14T15:41:06Z) - Survey of Social Bias in Vision-Language Models [65.44579542312489]
Survey aims to provide researchers with a high-level insight into the similarities and differences of social bias studies in pre-trained models across NLP, CV, and VL.
The findings and recommendations presented here can benefit the ML community, fostering the development of fairer and non-biased AI models.
arXiv Detail & Related papers (2023-09-24T15:34:56Z) - Bias and Fairness in Large Language Models: A Survey [73.87651986156006]
We present a comprehensive survey of bias evaluation and mitigation techniques for large language models (LLMs)
We first consolidate, formalize, and expand notions of social bias and fairness in natural language processing.
We then unify the literature by proposing three intuitive, two for bias evaluation, and one for mitigation.
arXiv Detail & Related papers (2023-09-02T00:32:55Z) - Social Processes: Self-Supervised Forecasting of Nonverbal Cues in
Social Conversations [22.302509912465077]
We take the first step in the direction of a bottom-up self-supervised approach in the domain of social human interactions.
We formulate the task of Social Cue Forecasting to leverage the larger amount of unlabeled low-level behavior cues.
We propose the Social Process (SP) models--socially aware sequence-to-sequence (Seq2Seq) models within the Neural Process (NP) family.
arXiv Detail & Related papers (2021-07-28T18:01:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.