Related papers: Learning from Red Teaming: Gender Bias Provocation and Mitigation in Large Language Models

Learning from Red Teaming: Gender Bias Provocation and Mitigation in Large Language Models

URL: http://arxiv.org/abs/2310.11079v1
Date: Tue, 17 Oct 2023 08:56:04 GMT
Title: Learning from Red Teaming: Gender Bias Provocation and Mitigation in Large Language Models
Authors: Hsuan Su, Cheng-Chu Cheng, Hua Farn, Shachi H Kumar, Saurav Sahay, Shang-Tse Chen, Hung-yi Lee
Abstract summary: Large language models (LLMs) encode potential biases while retaining disparities that can harm humans during interactions. We propose a first-of-its-kind method that automatically generates test cases to detect LLMs' potential gender bias. To address the biases identified, we propose a mitigation strategy that uses the generated test cases as demonstrations for in-context learning.
Score: 43.44112117935541
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, researchers have made considerable improvements in dialogue systems with the progress of large language models (LLMs) such as ChatGPT and GPT-4. These LLM-based chatbots encode the potential biases while retaining disparities that can harm humans during interactions. The traditional biases investigation methods often rely on human-written test cases. However, these test cases are usually expensive and limited. In this work, we propose a first-of-its-kind method that automatically generates test cases to detect LLMs' potential gender bias. We apply our method to three well-known LLMs and find that the generated test cases effectively identify the presence of biases. To address the biases identified, we propose a mitigation strategy that uses the generated test cases as demonstrations for in-context learning to circumvent the need for parameter fine-tuning. The experimental results show that LLMs generate fairer responses with the proposed approach.

Related papers

Mitigating Spurious Correlations in LLMs via Causality-Aware Post-Training [57.03005244917803]
Large language models (LLMs) often fail on out-of-distribution (OOD) samples due to spurious correlations acquired during pre-training.<n>Here, we aim to mitigate such spurious correlations through causality-aware post-training (CAPT)<n> Experiments on the formal causal inference benchmark CLadder and the logical reasoning dataset PrOntoQA show that 3B-scale language models fine-tuned with CAPT can outperform both traditional SFT and larger LLMs on in-distribution (ID) and OOD tasks.
arXiv Detail & Related papers (2025-06-11T06:30:28Z)
Cognitive Biases in Large Language Models: A Survey and Mitigation Experiments [24.15688619889342]
Large Language Models (LLMs) are trained on large corpora written by humans and demonstrate high performance on various tasks. As humans are susceptible to cognitive biases, LLMs can also be influenced by these biases, leading to irrational decision-making.
arXiv Detail & Related papers (2024-11-30T02:37:59Z)
Using Large Language Models for Expert Prior Elicitation in Predictive Modelling [53.54623137152208]
This study proposes using large language models (LLMs) to elicit expert prior distributions for predictive models. We compare LLM-elicited and uninformative priors, evaluate whether LLMs truthfully generate parameter distributions, and propose a model selection strategy for in-context learning and prior elicitation. Our findings show that LLM-elicited prior parameter distributions significantly reduce predictive error compared to uninformative priors in low-data settings.
arXiv Detail & Related papers (2024-11-26T10:13:39Z)
Causal-Guided Active Learning for Debiasing Large Language Models [40.853803921563596]
Current generative large language models (LLMs) may still capture dataset biases and utilize them for generation. Previous prior-knowledge-based debiasing methods and fine-tuning-based debiasing methods may not be suitable for current LLMs. We propose a casual-guided active learning framework, which utilizes LLMs itself to automatically and autonomously identify informative biased samples and induce the bias patterns.
arXiv Detail & Related papers (2024-08-23T09:46:15Z)
Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models [47.545382591646565]
Large Language Models (LLMs) have excelled at language understanding and generating human-level text. LLMs are susceptible to adversarial attacks where malicious users prompt the model to generate undesirable text. In this work, we train models to automatically create adversarial prompts to elicit biased responses from target LLMs.
arXiv Detail & Related papers (2024-08-07T17:11:34Z)
The African Woman is Rhythmic and Soulful: An Investigation of Implicit Biases in LLM Open-ended Text Generation [3.9945212716333063]
Implicit biases are significant because they influence the decisions made by Large Language Models (LLMs) Traditionally, explicit bias tests or embedding-based methods are employed to detect bias, but these approaches can overlook more nuanced, implicit forms of bias. We introduce two novel psychological-inspired methodologies to reveal and measure implicit biases through prompt-based and decision-making tasks.
arXiv Detail & Related papers (2024-07-01T13:21:33Z)
Curiosity-driven Red-teaming for Large Language Models [43.448044721642916]
Large language models (LLMs) hold great potential for many natural language applications but risk generating incorrect or toxic content. relying solely on human testers is expensive and time-consuming. Our method of curiosity-driven red teaming (CRT) achieves greater coverage of test cases while mantaining or increasing their effectiveness compared to existing methods.
arXiv Detail & Related papers (2024-02-29T18:55:03Z)
Likelihood-based Mitigation of Evaluation Bias in Large Language Models [37.07596663793111]
Large Language Models (LLMs) are widely used to evaluate natural language generation tasks as automated metrics. It is possible that there might be a likelihood bias if LLMs are used for evaluation.
arXiv Detail & Related papers (2024-02-25T04:52:02Z)
Disclosure and Mitigation of Gender Bias in LLMs [64.79319733514266]
Large Language Models (LLMs) can generate biased responses. We propose an indirect probing framework based on conditional generation. We explore three distinct strategies to disclose explicit and implicit gender bias in LLMs.
arXiv Detail & Related papers (2024-02-17T04:48:55Z)
ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases. We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets. Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z)
A New Benchmark and Reverse Validation Method for Passage-level Hallucination Detection [63.56136319976554]
Large Language Models (LLMs) generate hallucinations, which can cause significant damage when deployed for mission-critical tasks. We propose a self-check approach based on reverse validation to detect factual errors automatically in a zero-resource fashion. We empirically evaluate our method and existing zero-resource detection methods on two datasets.
arXiv Detail & Related papers (2023-10-10T10:14:59Z)
BiasTestGPT: Using ChatGPT for Social Bias Testing of Language Models [73.29106813131818]
bias testing is currently cumbersome since the test sentences are generated from a limited set of manual templates or need expensive crowd-sourcing. We propose using ChatGPT for the controllable generation of test sentences, given any arbitrary user-specified combination of social groups and attributes. We present an open-source comprehensive bias testing framework (BiasTestGPT), hosted on HuggingFace, that can be plugged into any open-source PLM for bias testing.
arXiv Detail & Related papers (2023-02-14T22:07:57Z)
Red Teaming Language Models with Language Models [8.237872606555383]
Language Models (LMs) often cannot be deployed because of their potential to harm users in hard-to-predict ways. Prior work identifies harmful behaviors before deployment by using human annotators to hand-write test cases. In this work, we automatically find cases where a target LM behaves in a harmful way, by generating test cases ("red teaming") using another LM.
arXiv Detail & Related papers (2022-02-07T15:22:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.