Evaluating Gender Bias in Large Language Models via Chain-of-Thought
Prompting
- URL: http://arxiv.org/abs/2401.15585v1
- Date: Sun, 28 Jan 2024 06:50:10 GMT
- Title: Evaluating Gender Bias in Large Language Models via Chain-of-Thought
Prompting
- Authors: Masahiro Kaneko, Danushka Bollegala, Naoaki Okazaki, Timothy Baldwin
- Abstract summary: Large language models (LLMs) equipped with Chain-of-Thought (CoT) prompting are able to make accurate incremental predictions even on unscalable tasks.
This study examines the impact of LLMs' step-by-step predictions on gender bias in unscalable tasks.
- Score: 87.30837365008931
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There exist both scalable tasks, like reading comprehension and
fact-checking, where model performance improves with model size, and unscalable
tasks, like arithmetic reasoning and symbolic reasoning, where model
performance does not necessarily improve with model size. Large language models
(LLMs) equipped with Chain-of-Thought (CoT) prompting are able to make accurate
incremental predictions even on unscalable tasks. Unfortunately, despite their
exceptional reasoning abilities, LLMs tend to internalize and reproduce
discriminatory societal biases. Whether CoT can provide discriminatory or
egalitarian rationalizations for the implicit information in unscalable tasks
remains an open question.
In this study, we examine the impact of LLMs' step-by-step predictions on
gender bias in unscalable tasks. For this purpose, we construct a benchmark for
an unscalable task where the LLM is given a list of words comprising feminine,
masculine, and gendered occupational words, and is required to count the number
of feminine and masculine words. In our CoT prompts, we require the LLM to
explicitly indicate whether each word in the word list is a feminine or
masculine before making the final predictions. With counting and handling the
meaning of words, this benchmark has characteristics of both arithmetic
reasoning and symbolic reasoning. Experimental results in English show that
without step-by-step prediction, most LLMs make socially biased predictions,
despite the task being as simple as counting words. Interestingly, CoT
prompting reduces this unconscious social bias in LLMs and encourages fair
predictions.
Related papers
- Learning to Generate Explainable Stock Predictions using Self-Reflective
Large Language Models [54.21695754082441]
We propose a framework to teach Large Language Models (LLMs) to generate explainable stock predictions.
A reflective agent learns how to explain past stock movements through self-reasoning, while the PPO trainer trains the model to generate the most likely explanations.
Our framework can outperform both traditional deep-learning and LLM methods in prediction accuracy and Matthews correlation coefficient.
arXiv Detail & Related papers (2024-02-06T03:18:58Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - Probing Explicit and Implicit Gender Bias through LLM Conditional Text
Generation [64.79319733514266]
Large Language Models (LLMs) can generate biased and toxic responses.
We propose a conditional text generation mechanism without the need for predefined gender phrases and stereotypes.
arXiv Detail & Related papers (2023-11-01T05:31:46Z) - "I'd Like to Have an Argument, Please": Argumentative Reasoning in Large Language Models [0.0]
We evaluate two large language models (LLMs) ability to perform argumentative reasoning.
We find that scoring-wise the LLMs match or surpass the SOTA in AM and APE.
However, statistical analysis on the LLMs outputs when subject to small, yet still human-readable, alterations in the I/O representations showed that the models are not performing reasoning.
arXiv Detail & Related papers (2023-09-29T02:41:38Z) - Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds [59.71218039095155]
We evaluate language understanding capacities on simple inference tasks that most humans find trivial.
We target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments.
The models exhibit moderate to low performance on these evaluation sets.
arXiv Detail & Related papers (2023-05-24T06:41:09Z) - ThinkSum: Probabilistic reasoning over sets using large language models [18.123895485602244]
We propose a two-stage probabilistic inference paradigm, ThinkSum, which reasons over sets of objects or facts in a structured manner.
We demonstrate the possibilities and advantages of ThinkSum on the BIG-bench suite of LLM evaluation tasks.
arXiv Detail & Related papers (2022-10-04T00:34:01Z) - Underspecification in Language Modeling Tasks: A Causality-Informed
Study of Gendered Pronoun Resolution [0.0]
We introduce a simple causal mechanism to describe the role underspecification plays in the generation of spurious correlations.
Despite its simplicity, our causal model directly informs the development of two lightweight black-box evaluation methods.
arXiv Detail & Related papers (2022-09-30T23:10:11Z) - Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines.
In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics.
Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.