Test-Time Fairness and Robustness in Large Language Models
- URL: http://arxiv.org/abs/2406.07685v2
- Date: Fri, 04 Oct 2024 21:10:33 GMT
- Title: Test-Time Fairness and Robustness in Large Language Models
- Authors: Leonardo Cotta, Chris J. Maddison,
- Abstract summary: Frontier Large Language Models (LLMs) can be socially discriminatory or sensitive to spurious features of their inputs.
Existing solutions, which instruct the LLM to be fair or robust, rely on the model's implicit understanding of bias.
We show that our prompting strategy, unlike implicit instructions, consistently reduces the bias of frontier LLMs.
- Score: 17.758735680493917
- License:
- Abstract: Frontier Large Language Models (LLMs) can be socially discriminatory or sensitive to spurious features of their inputs. Because only well-resourced corporations can train frontier LLMs, we need robust test-time strategies to control such biases. Existing solutions, which instruct the LLM to be fair or robust, rely on the model's implicit understanding of bias. Causality provides a rich formalism through which we can be explicit about our debiasing requirements. Yet, as we show, a naive application of the standard causal debiasing strategy, counterfactual data augmentation, fails under standard assumptions to debias predictions at an individual level at test time. To address this, we develop a stratified notion of debiasing called stratified invariance, which can capture a range of debiasing requirements from population level to individual level through an additional measurement that stratifies the predictions. We present a complete observational test for stratified invariance. Finally, we introduce a data augmentation strategy that guarantees stratified invariance at test time under suitable assumptions, together with a prompting strategy that encourages stratified invariance in LLMs. We show that our prompting strategy, unlike implicit instructions, consistently reduces the bias of frontier LLMs across a suite of synthetic and real-world benchmarks without requiring additional data, finetuning or pre-training.
Related papers
- Investigating Implicit Bias in Large Language Models: A Large-Scale Study of Over 50 LLMs [0.0]
Large Language Models (LLMs) are being adopted across a wide range of tasks.
Recent research indicates that LLMs can harbor implicit biases even when they pass explicit bias evaluations.
This study highlights that newer or larger language models do not automatically exhibit reduced bias.
arXiv Detail & Related papers (2024-10-13T03:43:18Z) - Editable Fairness: Fine-Grained Bias Mitigation in Language Models [52.66450426729818]
We propose a novel debiasing approach, Fairness Stamp (FAST), which enables fine-grained calibration of individual social biases.
FAST surpasses state-of-the-art baselines with superior debiasing performance.
This highlights the potential of fine-grained debiasing strategies to achieve fairness in large language models.
arXiv Detail & Related papers (2024-08-07T17:14:58Z) - Quantifying Prediction Consistency Under Model Multiplicity in Tabular LLMs [10.494477811252034]
Fine-tuning large language models can lead to textitfine-tuning multiplicity, where equally well-performing models make conflicting predictions on the same inputs.
This raises critical concerns about the robustness and reliability of Tabular LLMs.
This work proposes a novel metric to quantify the robustness of individual predictions without expensive model retraining.
arXiv Detail & Related papers (2024-07-04T22:22:09Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation [62.2436697657307]
Prediction-powered inference (PPI) is a method that improves statistical estimates based on limited human-labeled data.
We propose a method called Stratified Prediction-Powered Inference (StratPPI)
We show that the basic PPI estimates can be considerably improved by employing simple data stratification strategies.
arXiv Detail & Related papers (2024-06-06T17:37:39Z) - Towards Understanding Task-agnostic Debiasing Through the Lenses of Intrinsic Bias and Forgetfulness [10.081447621656523]
The impact on language modeling ability can be alleviated given a high-quality and long-contextualized debiasing corpus.
The effectiveness of task-agnostic debiasing hinges on the quantitative bias level of both the task-specific data used for downstream applications and the debiased model.
We propose a novel framework which can Propagate Socially-fair Debiasing to Downstream Fine-tuning, ProSocialTuning.
arXiv Detail & Related papers (2024-06-06T15:11:11Z) - Language Model Cascades: Token-level uncertainty and beyond [65.38515344964647]
Recent advances in language models (LMs) have led to significant improvements in quality on complex NLP tasks.
Cascading offers a simple strategy to achieve more favorable cost-quality tradeoffs.
We show that incorporating token-level uncertainty through learned post-hoc deferral rules can significantly outperform simple aggregation strategies.
arXiv Detail & Related papers (2024-04-15T21:02:48Z) - Selective Learning: Towards Robust Calibration with Dynamic Regularization [79.92633587914659]
Miscalibration in deep learning refers to there is a discrepancy between the predicted confidence and performance.
We introduce Dynamic Regularization (DReg) which aims to learn what should be learned during training thereby circumventing the confidence adjusting trade-off.
arXiv Detail & Related papers (2024-02-13T11:25:20Z) - Delving into Identify-Emphasize Paradigm for Combating Unknown Bias [52.76758938921129]
We propose an effective bias-conflicting scoring method (ECS) to boost the identification accuracy.
We also propose gradient alignment (GA) to balance the contributions of the mined bias-aligned and bias-conflicting samples.
Experiments are conducted on multiple datasets in various settings, demonstrating that the proposed solution can mitigate the impact of unknown biases.
arXiv Detail & Related papers (2023-02-22T14:50:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.