Steering LLMs Towards Unbiased Responses: A Causality-Guided Debiasing
Framework
- URL: http://arxiv.org/abs/2403.08743v1
- Date: Wed, 13 Mar 2024 17:46:28 GMT
- Title: Steering LLMs Towards Unbiased Responses: A Causality-Guided Debiasing
Framework
- Authors: Jingling Li, Zeyu Tang, Xiaoyu Liu, Peter Spirtes, Kun Zhang, Liu
Leqi, Yang Liu
- Abstract summary: Large language models (LLMs) can easily generate biased and discriminative responses.
This paper focuses on social bias, tackling the association between demographic information and LLM outputs.
- Score: 20.753141804841
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large language models (LLMs) can easily generate biased and discriminative
responses. As LLMs tap into consequential decision-making (e.g., hiring and
healthcare), it is of crucial importance to develop strategies to mitigate
these biases. This paper focuses on social bias, tackling the association
between demographic information and LLM outputs. We propose a causality-guided
debiasing framework that utilizes causal understandings of (1) the
data-generating process of the training corpus fed to LLMs, and (2) the
internal reasoning process of LLM inference, to guide the design of prompts for
debiasing LLM outputs through selection mechanisms. Our framework unifies
existing de-biasing prompting approaches such as inhibitive instructions and
in-context contrastive examples, and sheds light on new ways of debiasing by
encouraging bias-free reasoning. Our strong empirical performance on real-world
datasets demonstrates that our framework provides principled guidelines on
debiasing LLM outputs even with only the black-box access.
Related papers
- Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We show that prompting-based rationales align better with human-annotated rationales than attribution-based rationales.
We additionally find that the faithfulness limitations of prompting-based methods, which are identified in previous work, may be linked to their collapsed predictions.
arXiv Detail & Related papers (2024-06-28T20:06:30Z) - UniBias: Unveiling and Mitigating LLM Bias through Internal Attention and FFN Manipulation [12.04811490937078]
We investigate how feedforward neural networks (FFNs) and attention heads result in the bias of large language models (LLMs)
To mitigate these biases, we introduce UniBias, an inference-only method that effectively identifies and eliminates biased FFN vectors and attention heads.
arXiv Detail & Related papers (2024-05-31T03:59:15Z) - A Causal Explainable Guardrails for Large Language Models [29.441292837667415]
Large Language Models (LLMs) have shown impressive performance in natural language tasks, but their outputs can exhibit undesirable attributes or biases.
Existing methods for steering LLMs towards desired attributes often assume unbiased representations and rely solely on steering prompts.
We propose LLMGuardaril, a novel framework that incorporates causal analysis and adversarial learning to obtain unbiased steering representations.
arXiv Detail & Related papers (2024-05-07T09:55:05Z) - Causal Prompting: Debiasing Large Language Model Prompting based on Front-Door Adjustment [32.12998469814097]
A novel causal prompting method based on front-door adjustment is proposed to effectively mitigate Large Language Models (LLMs) biases.
Experimental results show that the proposed causal prompting approach achieves excellent performance across seven natural language processing datasets.
arXiv Detail & Related papers (2024-03-05T07:47:34Z) - Exploring Value Biases: How LLMs Deviate Towards the Ideal [57.99044181599786]
Large-Language-Models (LLMs) are deployed in a wide range of applications, and their response has an increasing social impact.
We show that value bias is strong in LLMs across different categories, similar to the results found in human studies.
arXiv Detail & Related papers (2024-02-16T18:28:43Z) - The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context
Learning [61.68787689234622]
A recent study, LIMA, shows that using merely 1K examples for alignment tuning can achieve significant alignment performance as well.
This raises questions about how exactly the alignment tuning transforms a base LLM.
We show that the gap between tuning-free and tuning-based alignment methods can be significantly reduced through strategic prompting.
arXiv Detail & Related papers (2023-12-04T00:46:11Z) - Exploring the Jungle of Bias: Political Bias Attribution in Language Models via Dependency Analysis [86.49858739347412]
Large Language Models (LLMs) have sparked intense debate regarding the prevalence of bias in these models and its mitigation.
We propose a prompt-based method for the extraction of confounding and mediating attributes which contribute to the decision process.
We find that the observed disparate treatment can at least in part be attributed to confounding and mitigating attributes and model misalignment.
arXiv Detail & Related papers (2023-11-15T00:02:25Z) - Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks.
How do we evaluate the capabilities of LLMs to consistently produce factually correct answers?
We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z) - Survey on Factuality in Large Language Models: Knowledge, Retrieval and
Domain-Specificity [61.54815512469125]
This survey addresses the crucial issue of factuality in Large Language Models (LLMs)
As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital.
arXiv Detail & Related papers (2023-10-11T14:18:03Z) - A Survey on Fairness in Large Language Models [28.05516809190299]
Large Language Models (LLMs) have shown powerful performance and development prospects.
LLMs can capture social biases from unprocessed training data and propagate the biases to downstream tasks.
Unfair LLM systems have undesirable social impacts and potential harms.
arXiv Detail & Related papers (2023-08-20T03:30:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.