Learning from Red Teaming: Gender Bias Provocation and Mitigation in
Large Language Models
- URL: http://arxiv.org/abs/2310.11079v1
- Date: Tue, 17 Oct 2023 08:56:04 GMT
- Title: Learning from Red Teaming: Gender Bias Provocation and Mitigation in
Large Language Models
- Authors: Hsuan Su, Cheng-Chu Cheng, Hua Farn, Shachi H Kumar, Saurav Sahay,
Shang-Tse Chen, Hung-yi Lee
- Abstract summary: Large language models (LLMs) encode potential biases while retaining disparities that can harm humans during interactions.
We propose a first-of-its-kind method that automatically generates test cases to detect LLMs' potential gender bias.
To address the biases identified, we propose a mitigation strategy that uses the generated test cases as demonstrations for in-context learning.
- Score: 43.44112117935541
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, researchers have made considerable improvements in dialogue systems
with the progress of large language models (LLMs) such as ChatGPT and GPT-4.
These LLM-based chatbots encode the potential biases while retaining
disparities that can harm humans during interactions. The traditional biases
investigation methods often rely on human-written test cases. However, these
test cases are usually expensive and limited. In this work, we propose a
first-of-its-kind method that automatically generates test cases to detect
LLMs' potential gender bias. We apply our method to three well-known LLMs and
find that the generated test cases effectively identify the presence of biases.
To address the biases identified, we propose a mitigation strategy that uses
the generated test cases as demonstrations for in-context learning to
circumvent the need for parameter fine-tuning. The experimental results show
that LLMs generate fairer responses with the proposed approach.
Related papers
- Causal-Guided Active Learning for Debiasing Large Language Models [40.853803921563596]
Current generative large language models (LLMs) may still capture dataset biases and utilize them for generation.
Previous prior-knowledge-based debiasing methods and fine-tuning-based debiasing methods may not be suitable for current LLMs.
We propose a casual-guided active learning framework, which utilizes LLMs itself to automatically and autonomously identify informative biased samples and induce the bias patterns.
arXiv Detail & Related papers (2024-08-23T09:46:15Z) - Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models [47.545382591646565]
Large Language Models (LLMs) have excelled at language understanding and generating human-level text.
LLMs are susceptible to adversarial attacks where malicious users prompt the model to generate undesirable text.
In this work, we train models to automatically create adversarial prompts to elicit biased responses from target LLMs.
arXiv Detail & Related papers (2024-08-07T17:11:34Z) - The African Woman is Rhythmic and Soulful: An Investigation of Implicit Biases in LLM Open-ended Text Generation [3.9945212716333063]
Implicit biases are significant because they influence the decisions made by Large Language Models (LLMs)
Traditionally, explicit bias tests or embedding-based methods are employed to detect bias, but these approaches can overlook more nuanced, implicit forms of bias.
We introduce two novel psychological-inspired methodologies to reveal and measure implicit biases through prompt-based and decision-making tasks.
arXiv Detail & Related papers (2024-07-01T13:21:33Z) - Curiosity-driven Red-teaming for Large Language Models [43.448044721642916]
Large language models (LLMs) hold great potential for many natural language applications but risk generating incorrect or toxic content.
relying solely on human testers is expensive and time-consuming.
Our method of curiosity-driven red teaming (CRT) achieves greater coverage of test cases while mantaining or increasing their effectiveness compared to existing methods.
arXiv Detail & Related papers (2024-02-29T18:55:03Z) - Likelihood-based Mitigation of Evaluation Bias in Large Language Models [37.07596663793111]
Large Language Models (LLMs) are widely used to evaluate natural language generation tasks as automated metrics.
It is possible that there might be a likelihood bias if LLMs are used for evaluation.
arXiv Detail & Related papers (2024-02-25T04:52:02Z) - Disclosure and Mitigation of Gender Bias in LLMs [64.79319733514266]
Large Language Models (LLMs) can generate biased responses.
We propose an indirect probing framework based on conditional generation.
We explore three distinct strategies to disclose explicit and implicit gender bias in LLMs.
arXiv Detail & Related papers (2024-02-17T04:48:55Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z) - A New Benchmark and Reverse Validation Method for Passage-level
Hallucination Detection [63.56136319976554]
Large Language Models (LLMs) generate hallucinations, which can cause significant damage when deployed for mission-critical tasks.
We propose a self-check approach based on reverse validation to detect factual errors automatically in a zero-resource fashion.
We empirically evaluate our method and existing zero-resource detection methods on two datasets.
arXiv Detail & Related papers (2023-10-10T10:14:59Z) - BiasTestGPT: Using ChatGPT for Social Bias Testing of Language Models [73.29106813131818]
bias testing is currently cumbersome since the test sentences are generated from a limited set of manual templates or need expensive crowd-sourcing.
We propose using ChatGPT for the controllable generation of test sentences, given any arbitrary user-specified combination of social groups and attributes.
We present an open-source comprehensive bias testing framework (BiasTestGPT), hosted on HuggingFace, that can be plugged into any open-source PLM for bias testing.
arXiv Detail & Related papers (2023-02-14T22:07:57Z) - Red Teaming Language Models with Language Models [8.237872606555383]
Language Models (LMs) often cannot be deployed because of their potential to harm users in hard-to-predict ways.
Prior work identifies harmful behaviors before deployment by using human annotators to hand-write test cases.
In this work, we automatically find cases where a target LM behaves in a harmful way, by generating test cases ("red teaming") using another LM.
arXiv Detail & Related papers (2022-02-07T15:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.