Learning from Red Teaming: Gender Bias Provocation and Mitigation in
Large Language Models
- URL: http://arxiv.org/abs/2310.11079v1
- Date: Tue, 17 Oct 2023 08:56:04 GMT
- Title: Learning from Red Teaming: Gender Bias Provocation and Mitigation in
Large Language Models
- Authors: Hsuan Su, Cheng-Chu Cheng, Hua Farn, Shachi H Kumar, Saurav Sahay,
Shang-Tse Chen, Hung-yi Lee
- Abstract summary: Large language models (LLMs) encode potential biases while retaining disparities that can harm humans during interactions.
We propose a first-of-its-kind method that automatically generates test cases to detect LLMs' potential gender bias.
To address the biases identified, we propose a mitigation strategy that uses the generated test cases as demonstrations for in-context learning.
- Score: 43.44112117935541
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, researchers have made considerable improvements in dialogue systems
with the progress of large language models (LLMs) such as ChatGPT and GPT-4.
These LLM-based chatbots encode the potential biases while retaining
disparities that can harm humans during interactions. The traditional biases
investigation methods often rely on human-written test cases. However, these
test cases are usually expensive and limited. In this work, we propose a
first-of-its-kind method that automatically generates test cases to detect
LLMs' potential gender bias. We apply our method to three well-known LLMs and
find that the generated test cases effectively identify the presence of biases.
To address the biases identified, we propose a mitigation strategy that uses
the generated test cases as demonstrations for in-context learning to
circumvent the need for parameter fine-tuning. The experimental results show
that LLMs generate fairer responses with the proposed approach.
Related papers
- Curiosity-driven Red-teaming for Large Language Models [43.448044721642916]
Large language models (LLMs) hold great potential for many natural language applications but risk generating incorrect or toxic content.
relying solely on human testers is expensive and time-consuming.
Our method of curiosity-driven red teaming (CRT) achieves greater coverage of test cases while mantaining or increasing their effectiveness compared to existing methods.
arXiv Detail & Related papers (2024-02-29T18:55:03Z) - Likelihood-based Mitigation of Evaluation Bias in Large Language Models [39.77680080235204]
Large Language Models (LLMs) are widely used to evaluate natural language generation tasks as automated metrics.
It is possible that there might be a likelihood bias if LLMs are used for evaluation.
arXiv Detail & Related papers (2024-02-25T04:52:02Z) - Disclosure and Mitigation of Gender Bias in LLMs [64.79319733514266]
Large Language Models (LLMs) can generate biased responses.
We propose an indirect probing framework based on conditional generation.
We explore three distinct strategies to disclose explicit and implicit gender bias in LLMs.
arXiv Detail & Related papers (2024-02-17T04:48:55Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z) - A New Benchmark and Reverse Validation Method for Passage-level
Hallucination Detection [63.56136319976554]
Large Language Models (LLMs) generate hallucinations, which can cause significant damage when deployed for mission-critical tasks.
We propose a self-check approach based on reverse validation to detect factual errors automatically in a zero-resource fashion.
We empirically evaluate our method and existing zero-resource detection methods on two datasets.
arXiv Detail & Related papers (2023-10-10T10:14:59Z) - Efficiently Measuring the Cognitive Ability of LLMs: An Adaptive Testing
Perspective [63.92197404447808]
Large language models (LLMs) have shown some human-like cognitive abilities.
We propose an adaptive testing framework for LLM evaluation.
This approach dynamically adjusts the characteristics of the test questions, such as difficulty, based on the model's performance.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - Sources of Hallucination by Large Language Models on Inference Tasks [16.644096408742325]
Large Language Models (LLMs) are claimed to be capable of Natural Language Inference (NLI)
We present a series of behavioral studies on several LLM families which probe their behavior using controlled experiments.
arXiv Detail & Related papers (2023-05-23T22:24:44Z) - Testing Occupational Gender Bias in Language Models: Towards Robust Measurement and Zero-Shot Debiasing [98.07536837448293]
Large language models (LLMs) have been shown to exhibit a variety of harmful, human-like biases against various demographics.
We introduce a list of desiderata for robustly measuring biases in generative language models.
We then use this benchmark to test several state-of-the-art open-source LLMs, including Llama, Mistral, and their instruction-tuned versions.
arXiv Detail & Related papers (2022-12-20T22:41:24Z) - Red Teaming Language Models with Language Models [8.237872606555383]
Language Models (LMs) often cannot be deployed because of their potential to harm users in hard-to-predict ways.
Prior work identifies harmful behaviors before deployment by using human annotators to hand-write test cases.
In this work, we automatically find cases where a target LM behaves in a harmful way, by generating test cases ("red teaming") using another LM.
arXiv Detail & Related papers (2022-02-07T15:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.