Related papers: Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs

Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs

URL: http://arxiv.org/abs/2507.07186v2
Date: Sat, 12 Jul 2025 10:00:59 GMT
Title: Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs
Authors: Itay Itzhak, Yonatan Belinkov, Gabriel Stanovsky,
Abstract summary: Large language models (LLMs) exhibit cognitive biases.<n>These biases vary across models and can be amplified by instruction tuning.<n>It remains unclear if these differences in biases stem from pretraining, finetuning, or even random noise.
Score: 51.00909549291524
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) exhibit cognitive biases -- systematic tendencies of irrational decision-making, similar to those seen in humans. Prior work has found that these biases vary across models and can be amplified by instruction tuning. However, it remains unclear if these differences in biases stem from pretraining, finetuning, or even random noise due to training stochasticity. We propose a two-step causal experimental approach to disentangle these factors. First, we finetune models multiple times using different random seeds to study how training randomness affects over $30$ cognitive biases. Second, we introduce \emph{cross-tuning} -- swapping instruction datasets between models to isolate bias sources. This swap uses datasets that led to different bias patterns, directly testing whether biases are dataset-dependent. Our findings reveal that while training randomness introduces some variability, biases are mainly shaped by pretraining: models with the same pretrained backbone exhibit more similar bias patterns than those sharing only finetuning data. These insights suggest that understanding biases in finetuned models requires considering their pretraining origins beyond finetuning effects. This perspective can guide future efforts to develop principled strategies for evaluating and mitigating bias in LLMs.

Related papers

Challenges learning from imbalanced data using tree-based models: Prevalence estimates systematically depend on hyperparameters and can be upwardly biased [0.0]
Imbalanced binary classification problems arise in many fields of study.<n>It is common to subsample the majority class to create a (more) balanced dataset for model training.<n>One way of accounting for this bias is to analytically map the resulting predictions to new values based on the sampling rate for the majority class.<n>We show that calibrating a random forest this way has unintended negative consequences, including prevalence estimates that can be upwardly biased.
arXiv Detail & Related papers (2024-12-17T19:38:29Z)
How far can bias go? -- Tracing bias from pretraining data to alignment [54.51310112013655]
This study examines the correlation between gender-occupation bias in pre-training data and their manifestation in LLMs.<n>Our findings reveal that biases present in pre-training data are amplified in model outputs.
arXiv Detail & Related papers (2024-11-28T16:20:25Z)
CosFairNet:A Parameter-Space based Approach for Bias Free Learning [1.9116784879310025]
Deep neural networks trained on biased data often inadvertently learn unintended inference rules. We introduce a novel approach to address bias directly in the model's parameter space, preventing its propagation across layers. We show enhanced classification accuracy and debiasing effectiveness across various synthetic and real-world datasets.
arXiv Detail & Related papers (2024-10-19T13:06:40Z)
Echoes: Unsupervised Debiasing via Pseudo-bias Labeling in an Echo Chamber [17.034228910493056]
This paper presents experimental analyses revealing that the existing biased models overfit to bias-conflicting samples in the training data. We propose a straightforward and effective method called Echoes, which trains a biased model and a target model with a different strategy. Our approach achieves superior debiasing results compared to the existing baselines on both synthetic and real-world datasets.
arXiv Detail & Related papers (2023-05-06T13:13:18Z)
Feature-Level Debiased Natural Language Understanding [86.8751772146264]
Existing natural language understanding (NLU) models often rely on dataset biases to achieve high performance on specific datasets. We propose debiasing contrastive learning (DCT) to mitigate biased latent features and neglect the dynamic nature of bias. DCT outperforms state-of-the-art baselines on out-of-distribution datasets while maintaining in-distribution performance.
arXiv Detail & Related papers (2022-12-11T06:16:14Z)
Pseudo Bias-Balanced Learning for Debiased Chest X-ray Classification [57.53567756716656]
We study the problem of developing debiased chest X-ray diagnosis models without knowing exactly the bias labels. We propose a novel algorithm, pseudo bias-balanced learning, which first captures and predicts per-sample bias labels. Our proposed method achieved consistent improvements over other state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-18T11:02:18Z)
Learning Debiased Models with Dynamic Gradient Alignment and Bias-conflicting Sample Mining [39.00256193731365]
Deep neural networks notoriously suffer from dataset biases which are detrimental to model robustness, generalization and fairness. We propose a two-stage debiasing scheme to combat against the intractable unknown biases.
arXiv Detail & Related papers (2021-11-25T14:50:10Z)
Bayesian analysis of the prevalence bias: learning and predicting from imbalanced data [10.659348599372944]
This paper lays the theoretical and computational framework for training models, and for prediction, in the presence of prevalence bias. It offers an alternative to principled training losses and complements test-time procedures based on selecting an operating point from summary curves. It integrates seamlessly in the current paradigm of (deep) learning using backpropagation and naturally with Bayesian models.
arXiv Detail & Related papers (2021-07-31T14:36:33Z)
Learning from others' mistakes: Avoiding dataset biases without modeling them [111.17078939377313]
State-of-the-art natural language processing (NLP) models often learn to model dataset biases and surface form correlations instead of features that target the intended task. Previous work has demonstrated effective methods to circumvent these issues when knowledge of the bias is available. We show a method for training models that learn to ignore these problematic correlations.
arXiv Detail & Related papers (2020-12-02T16:10:54Z)
Learning from Failure: Training Debiased Classifier from Biased Classifier [76.52804102765931]
We show that neural networks learn to rely on spurious correlation only when it is "easier" to learn than the desired knowledge. We propose a failure-based debiasing scheme by training a pair of neural networks simultaneously. Our method significantly improves the training of the network against various types of biases in both synthetic and real-world datasets.
arXiv Detail & Related papers (2020-07-06T07:20:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.