Uncovering Implicit Bias in Large Language Models with Concept Learning Dataset
- URL: http://arxiv.org/abs/2510.01219v1
- Date: Sun, 21 Sep 2025 09:04:31 GMT
- Title: Uncovering Implicit Bias in Large Language Models with Concept Learning Dataset
- Authors: Leroy Z. Wang,
- Abstract summary: Using in-context concept learning experiments, we found that language models may have a bias toward upward monotonicity in quantifiers.<n>This demonstrates that in-context concept learning can be an effective way to discover hidden biases in language models.
- Score: 0.038073142980733
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a dataset of concept learning tasks that helps uncover implicit biases in large language models. Using in-context concept learning experiments, we found that language models may have a bias toward upward monotonicity in quantifiers; such bias is less apparent when the model is tested by direct prompting without concept learning components. This demonstrates that in-context concept learning can be an effective way to discover hidden biases in language models.
Related papers
- Mitigating Biases in Language Models via Bias Unlearning [27.565946855618368]
We propose BiasUnlearn, a novel model debiasing framework which achieves targeted debiasing via dual-pathway unlearning mechanisms.<n>The results show that BiasUnlearn outperforms existing methods in mitigating bias in language models while retaining language modeling capabilities.
arXiv Detail & Related papers (2025-09-30T02:15:12Z) - Collapsed Language Models Promote Fairness [88.48232731113306]
We find that debiased language models exhibit collapsed alignment between token representations and word embeddings.<n>We design a principled fine-tuning method that can effectively improve fairness in a wide range of debiasing methods.
arXiv Detail & Related papers (2024-10-06T13:09:48Z) - Interpreting Pretrained Language Models via Concept Bottlenecks [55.47515772358389]
Pretrained language models (PLMs) have made significant strides in various natural language processing tasks.
The lack of interpretability due to their black-box'' nature poses challenges for responsible implementation.
We propose a novel approach to interpreting PLMs by employing high-level, meaningful concepts that are easily understandable for humans.
arXiv Detail & Related papers (2023-11-08T20:41:18Z) - Towards Understanding What Code Language Models Learned [10.989953856458996]
Pre-trained language models are effective in a variety of natural language tasks.
It has been argued their capabilities fall short of fully learning meaning or understanding language.
We investigate their ability to capture semantics of code beyond superficial frequency and co-occurrence.
arXiv Detail & Related papers (2023-06-20T23:42:14Z) - Commonsense Knowledge Transfer for Pre-trained Language Models [83.01121484432801]
We introduce commonsense knowledge transfer, a framework to transfer the commonsense knowledge stored in a neural commonsense knowledge model to a general-purpose pre-trained language model.
It first exploits general texts to form queries for extracting commonsense knowledge from the neural commonsense knowledge model.
It then refines the language model with two self-supervised objectives: commonsense mask infilling and commonsense relation prediction.
arXiv Detail & Related papers (2023-06-04T15:44:51Z) - Debiasing Vision-Language Models via Biased Prompts [79.04467131711775]
We propose a general approach for debiasing vision-language foundation models by projecting out biased directions in the text embedding.
We show that debiasing only the text embedding with a calibrated projection matrix suffices to yield robust classifiers and fair generative models.
arXiv Detail & Related papers (2023-01-31T20:09:33Z) - Counteracts: Testing Stereotypical Representation in Pre-trained
Language Models [4.211128681972148]
We use counterexamples to examine the internal stereotypical knowledge in pre-trained language models (PLMs)
We evaluate 7 PLMs on 9 types of cloze-style prompt with different information and base knowledge.
arXiv Detail & Related papers (2023-01-11T07:52:59Z) - Debiasing Methods in Natural Language Understanding Make Bias More
Accessible [28.877572447481683]
Recent debiasing methods in natural language understanding (NLU) improve performance on such datasets by pressuring models into making unbiased predictions.
We propose a general probing-based framework that allows for post-hoc interpretation of biases in language models.
We show that, counter-intuitively, the more a language model is pushed towards a debiased regime, the more bias is actually encoded in its inner representations.
arXiv Detail & Related papers (2021-09-09T08:28:22Z) - Learning from others' mistakes: Avoiding dataset biases without modeling
them [111.17078939377313]
State-of-the-art natural language processing (NLP) models often learn to model dataset biases and surface form correlations instead of features that target the intended task.
Previous work has demonstrated effective methods to circumvent these issues when knowledge of the bias is available.
We show a method for training models that learn to ignore these problematic correlations.
arXiv Detail & Related papers (2020-12-02T16:10:54Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.