Benchmarking Debiasing Methods for LLM-based Parameter Estimates
- URL: http://arxiv.org/abs/2506.09627v1
- Date: Wed, 11 Jun 2025 11:37:02 GMT
- Title: Benchmarking Debiasing Methods for LLM-based Parameter Estimates
- Authors: Nicolas Audinet de Pieuchon, Adel Daoud, Connor T. Jerzak, Moa Johansson, Richard Johansson,
- Abstract summary: Large language models (LLMs) offer an inexpensive yet powerful way to annotate text, but are often inconsistent when compared with experts.<n>To mitigate this bias, researchers have developed debiasing methods such as Design-based Supervised Learning (Supervised Learning) and Prediction-Powered Inference (PPI)<n>We compare DSL and PPI across a range of tasks, finding that although both achieve low bias with large datasets, DSL often outperforms PPI on bias reduction and empirical efficiency.
- Score: 7.790904593265873
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) offer an inexpensive yet powerful way to annotate text, but are often inconsistent when compared with experts. These errors can bias downstream estimates of population parameters such as regression coefficients and causal effects. To mitigate this bias, researchers have developed debiasing methods such as Design-based Supervised Learning (DSL) and Prediction-Powered Inference (PPI), which promise valid estimation by combining LLM annotations with a limited number of expensive expert annotations. Although these methods produce consistent estimates under theoretical assumptions, it is unknown how they compare in finite samples of sizes encountered in applied research. We make two contributions: First, we study how each method's performance scales with the number of expert annotations, highlighting regimes where LLM bias or limited expert labels significantly affect results. Second, we compare DSL and PPI across a range of tasks, finding that although both achieve low bias with large datasets, DSL often outperforms PPI on bias reduction and empirical efficiency, but its performance is less consistent across datasets. Our findings indicate that there is a bias-variance tradeoff at the level of debiasing methods, calling for more research on developing metrics for quantifying their efficiency in finite samples.
Related papers
- Mitigating Spurious Correlations in LLMs via Causality-Aware Post-Training [57.03005244917803]
Large language models (LLMs) often fail on out-of-distribution (OOD) samples due to spurious correlations acquired during pre-training.<n>Here, we aim to mitigate such spurious correlations through causality-aware post-training (CAPT)<n> Experiments on the formal causal inference benchmark CLadder and the logical reasoning dataset PrOntoQA show that 3B-scale language models fine-tuned with CAPT can outperform both traditional SFT and larger LLMs on in-distribution (ID) and OOD tasks.
arXiv Detail & Related papers (2025-06-11T06:30:28Z) - RAZOR: Sharpening Knowledge by Cutting Bias with Unsupervised Text Rewriting [16.633948320306832]
biases prevalent in manually constructed datasets can introduce spurious correlations between tokens and labels.<n>Existing debiasing methods often rely on prior knowledge of specific dataset biases.<n>We propose RAZOR, a novel, unsupervised, and data-focused debiasing approach based on text rewriting for shortcut mitigation.
arXiv Detail & Related papers (2024-12-10T17:02:58Z) - The Mismeasure of Man and Models: Evaluating Allocational Harms in Large Language Models [22.75594773147521]
We introduce Rank-Allocation-Based Bias Index (RABBI), a model-agnostic bias measure that assesses potential allocational harms arising from biases in large language models (LLMs)
Our results reveal that commonly-used bias metrics based on average performance gap and distribution distance fail to reliably capture group disparities in allocation outcomes.
Our work highlights the need to account for how models are used in contexts with limited resource constraints.
arXiv Detail & Related papers (2024-08-02T14:13:06Z) - Beyond Performance: Quantifying and Mitigating Label Bias in LLMs [8.77694178599322]
We evaluate different approaches to quantifying label bias in a model's predictions.
Our investigation reveals substantial label bias in models both before and after debiasing attempts.
We propose a novel label bias calibration method tailored for few-shot prompting.
arXiv Detail & Related papers (2024-05-04T19:53:03Z) - ChatGPT Based Data Augmentation for Improved Parameter-Efficient Debiasing of LLMs [65.9625653425636]
Large Language models (LLMs) exhibit harmful social biases.
This work introduces a novel approach utilizing ChatGPT to generate synthetic training data.
arXiv Detail & Related papers (2024-02-19T01:28:48Z) - Querying Easily Flip-flopped Samples for Deep Active Learning [63.62397322172216]
Active learning is a machine learning paradigm that aims to improve the performance of a model by strategically selecting and querying unlabeled data.
One effective selection strategy is to base it on the model's predictive uncertainty, which can be interpreted as a measure of how informative a sample is.
This paper proposes the it least disagree metric (LDM) as the smallest probability of disagreement of the predicted label.
arXiv Detail & Related papers (2024-01-18T08:12:23Z) - The Gaps between Pre-train and Downstream Settings in Bias Evaluation
and Debiasing [74.7319697510621]
In-Context Learning (ICL) induces smaller changes to PLMs compared to FT-based debiasing methods.
ICL-based debiasing methods show a higher correlation between intrinsic and extrinsic bias scores compared to FT-based methods.
arXiv Detail & Related papers (2024-01-16T17:15:08Z) - Benchmarking Causal Study to Interpret Large Language Models for Source
Code [6.301373791541809]
This paper introduces a benchmarking strategy named Galeras comprised of curated testbeds for three SE tasks.
We illustrate the insights of our benchmarking strategy by conducting a case study on the performance of ChatGPT under distinct prompt engineering methods.
arXiv Detail & Related papers (2023-08-23T20:32:12Z) - Learning to Estimate Without Bias [57.82628598276623]
Gauss theorem states that the weighted least squares estimator is a linear minimum variance unbiased estimation (MVUE) in linear models.
In this paper, we take a first step towards extending this result to non linear settings via deep learning with bias constraints.
A second motivation to BCE is in applications where multiple estimates of the same unknown are averaged for improved performance.
arXiv Detail & Related papers (2021-10-24T10:23:51Z) - Accelerated Convergence for Counterfactual Learning to Rank [65.63997193915257]
We show that convergence rate of SGD approaches with IPS-weighted gradients suffers from the large variance introduced by the IPS weights.
We propose a novel learning algorithm, called CounterSample, that has provably better convergence than standard IPS-weighted gradient descent methods.
We prove that CounterSample converges faster and complement our theoretical findings with empirical results.
arXiv Detail & Related papers (2020-05-21T12:53:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.