Semantic and Structural Analysis of Implicit Biases in Large Language Models: An Interpretable Approach
- URL: http://arxiv.org/abs/2508.06155v1
- Date: Fri, 08 Aug 2025 09:21:10 GMT
- Title: Semantic and Structural Analysis of Implicit Biases in Large Language Models: An Interpretable Approach
- Authors: Renhan Zhang, Lian Lian, Zhen Qi, Guiran Liu,
- Abstract summary: It proposes an interpretable bias detection method aimed at identifying hidden social biases in model outputs.<n>The method combines nested semantic representation with a contextual contrast mechanism.<n>The evaluation focuses on several key metrics, such as bias detection accuracy, semantic consistency, and contextual sensitivity.
- Score: 1.5749416770494704
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper addresses the issue of implicit stereotypes that may arise during the generation process of large language models. It proposes an interpretable bias detection method aimed at identifying hidden social biases in model outputs, especially those semantic tendencies that are not easily captured through explicit linguistic features. The method combines nested semantic representation with a contextual contrast mechanism. It extracts latent bias features from the vector space structure of model outputs. Using attention weight perturbation, it analyzes the model's sensitivity to specific social attribute terms, thereby revealing the semantic pathways through which bias is formed. To validate the effectiveness of the method, this study uses the StereoSet dataset, which covers multiple stereotype dimensions including gender, profession, religion, and race. The evaluation focuses on several key metrics, such as bias detection accuracy, semantic consistency, and contextual sensitivity. Experimental results show that the proposed method achieves strong detection performance across various dimensions. It can accurately identify bias differences between semantically similar texts while maintaining high semantic alignment and output stability. The method also demonstrates high interpretability in its structural design. It helps uncover the internal bias association mechanisms within language models. This provides a more transparent and reliable technical foundation for bias detection. The approach is suitable for real-world applications where high trustworthiness of generated content is required.
Related papers
- Explaining News Bias Detection: A Comparative SHAP Analysis of Transformer Model Decision Mechanisms [0.2538209532048867]
We present a comparative interpretability study of two bias detection models: a bias detector fine-tuned on the BABE dataset and a domain-adapted pre-trained RoBERTa model fine-tuned on the BABE dataset.<n>We analyze word-level attributions across correct and incorrect predictions to characterize how different model architectures operationalize linguistic bias.
arXiv Detail & Related papers (2025-12-29T19:58:11Z) - SCALEX: Scalable Concept and Latent Exploration for Diffusion Models [59.86284983662119]
Image generation models frequently encode social biases, including stereotypes tied to gender, race, and profession.<n>We introduce SCALEX, a framework for scalable and automated exploration of diffusion model latent spaces.<n>It extracts semantically meaningful directions from H-space using only natural language prompts, enabling zero-shot interpretation without retraining or labelling.
arXiv Detail & Related papers (2025-11-13T22:02:44Z) - Reliable Cross-modal Alignment via Prototype Iterative Construction [40.09297916971621]
Cross-modal alignment is an important multi-modal task, aiming to bridge the semantic gap between different modalities.<n>Conventional methods implicitly assume embeddings contain solely semantic information, ignoring the impact of non-semantic information during alignment.<n>We propose PICO, a novel framework for suppressing style interference during embedding interaction.
arXiv Detail & Related papers (2025-10-13T09:08:27Z) - Target-oriented Multimodal Sentiment Classification with Counterfactual-enhanced Debiasing [5.0175188046562385]
multimodal sentiment classification seeks to predict sentiment polarity for specific targets from image-text pairs.<n>Existing works often over-rely on textual content and fail to consider dataset biases.<n>We introduce a novel counterfactual-enhanced debiasing framework to reduce such spurious correlations.
arXiv Detail & Related papers (2025-09-11T05:40:53Z) - NBIAS: A Natural Language Processing Framework for Bias Identification
in Text [9.486702261615166]
Bias in textual data can lead to skewed interpretations and outcomes when the data is used.
An algorithm trained on biased data may end up making decisions that disproportionately impact a certain group of people.
We develop a comprehensive framework NBIAS that consists of four main layers: data, corpus construction, model development and an evaluation layer.
arXiv Detail & Related papers (2023-08-03T10:48:30Z) - Fixing confirmation bias in feature attribution methods via semantic
match [4.733072355085082]
We argue that a structured approach is required to test whether our hypotheses on the model are confirmed by the feature attributions.
This is what we call the "semantic match" between human concepts and (sub-symbolic) explanations.
arXiv Detail & Related papers (2023-07-03T09:50:08Z) - Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics
Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions.
This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z) - Mind Your Bias: A Critical Review of Bias Detection Methods for
Contextual Language Models [2.170169149901781]
We conduct a rigorous analysis and comparison of bias detection methods for contextual language models.
Our results show that minor design and implementation decisions (or errors) have a substantial and often significant impact on the derived bias scores.
arXiv Detail & Related papers (2022-11-15T19:27:54Z) - A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes.<n>We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - Balancing out Bias: Achieving Fairness Through Training Reweighting [58.201275105195485]
Bias in natural language processing arises from models learning characteristics of the author such as gender and race.
Existing methods for mitigating and measuring bias do not directly account for correlations between author demographics and linguistic variables.
This paper introduces a very simple but highly effective method for countering bias using instance reweighting.
arXiv Detail & Related papers (2021-09-16T23:40:28Z) - Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task.
The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them.
By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z) - Temporal Embeddings and Transformer Models for Narrative Text
Understanding [72.88083067388155]
We present two approaches to narrative text understanding for character relationship modelling.
The temporal evolution of these relations is described by dynamic word embeddings, that are designed to learn semantic changes over time.
A supervised learning approach based on the state-of-the-art transformer model BERT is used instead to detect static relations between characters.
arXiv Detail & Related papers (2020-03-19T14:23:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.