Capturing Failures of Large Language Models via Human Cognitive Biases
- URL: http://arxiv.org/abs/2202.12299v1
- Date: Thu, 24 Feb 2022 18:58:52 GMT
- Title: Capturing Failures of Large Language Models via Human Cognitive Biases
- Authors: Erik Jones, Jacob Steinhardt
- Abstract summary: We show that OpenAI's Codex errs based on how the input prompt is framed, adjusts outputs towards anchors, and is biased towards outputs that mimic frequent training examples.
Our experiments suggest that cognitive science can be a useful jumping-off point to better understand how contemporary machine learning systems behave.
- Score: 18.397404180932373
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models generate complex, open-ended outputs: instead of
outputting a single class, they can write summaries, generate dialogue, and
produce working code. In order to study the reliability of these open-ended
systems, we must understand not just when they fail, but also how they fail. To
approach this, we draw inspiration from human cognitive biases -- systematic
patterns of deviation from rational judgement. Specifically, we use cognitive
biases to (i) identify inputs that models are likely to err on, and (ii)
develop tests to qualitatively characterize their errors on these inputs. Using
code generation as a case study, we find that OpenAI's Codex errs predictably
based on how the input prompt is framed, adjusts outputs towards anchors, and
is biased towards outputs that mimic frequent training examples. We then use
our framework to uncover high-impact errors such as incorrectly deleting files.
Our experiments suggest that cognitive science can be a useful jumping-off
point to better understand how contemporary machine learning systems behave.
Related papers
- DISCERN: Decoding Systematic Errors in Natural Language for Text Classifiers [18.279429202248632]
We introduce DISCERN, a framework for interpreting systematic biases in text classifiers using language explanations.
DISCERN iteratively generates precise natural language descriptions of systematic errors by employing an interactive loop between two large language models.
We show that users can interpret systematic biases more effectively (by over 25% relative) and efficiently when described through language explanations as opposed to cluster exemplars.
arXiv Detail & Related papers (2024-10-29T17:04:55Z) - Collapsed Language Models Promote Fairness [88.48232731113306]
We find that debiased language models exhibit collapsed alignment between token representations and word embeddings.
We design a principled fine-tuning method that can effectively improve fairness in a wide range of debiasing methods.
arXiv Detail & Related papers (2024-10-06T13:09:48Z) - Self-Debiasing Large Language Models: Zero-Shot Recognition and
Reduction of Stereotypes [73.12947922129261]
We leverage the zero-shot capabilities of large language models to reduce stereotyping.
We show that self-debiasing can significantly reduce the degree of stereotyping across nine different social groups.
We hope this work opens inquiry into other zero-shot techniques for bias mitigation.
arXiv Detail & Related papers (2024-02-03T01:40:11Z) - Explore, Establish, Exploit: Red Teaming Language Models from Scratch [7.949645304649025]
We consider red-teaming "from scratch," in which the adversary does not begin with a way to classify failures.
We use this approach to red-team GPT-3 to discover classes of inputs that elicit false statements.
arXiv Detail & Related papers (2023-06-15T18:49:50Z) - Fairness-guided Few-shot Prompting for Large Language Models [93.05624064699965]
In-context learning can suffer from high instability due to variations in training examples, example order, and prompt formats.
We introduce a metric to evaluate the predictive bias of a fixed prompt against labels or a given attributes.
We propose a novel search strategy based on the greedy search to identify the near-optimal prompt for improving the performance of in-context learning.
arXiv Detail & Related papers (2023-03-23T12:28:25Z) - Debiasing Vision-Language Models via Biased Prompts [79.04467131711775]
We propose a general approach for debiasing vision-language foundation models by projecting out biased directions in the text embedding.
We show that debiasing only the text embedding with a calibrated projection matrix suffices to yield robust classifiers and fair generative models.
arXiv Detail & Related papers (2023-01-31T20:09:33Z) - Discovering Latent Knowledge in Language Models Without Supervision [72.95136739040676]
Existing techniques for training language models can be misaligned with the truth.
We propose directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way.
We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models.
arXiv Detail & Related papers (2022-12-07T18:17:56Z) - A Simple, Yet Effective Approach to Finding Biases in Code Generation [16.094062131137722]
This work shows that current code generation systems exhibit undesired biases inherited from their large language model backbones.
We propose the "block of influence" concept, which enables a modular decomposition and analysis of the coding challenges.
arXiv Detail & Related papers (2022-10-31T15:06:15Z) - Interpretable Data-Based Explanations for Fairness Debugging [7.266116143672294]
Gopher is a system that produces compact, interpretable, and causal explanations for bias or unexpected model behavior.
We introduce the concept of causal responsibility that quantifies the extent to which intervening on training data by removing or updating subsets of it can resolve the bias.
Building on this concept, we develop an efficient approach for generating the top-k patterns that explain model bias.
arXiv Detail & Related papers (2021-12-17T20:10:00Z) - Discovering and Validating AI Errors With Crowdsourced Failure Reports [10.4818618376202]
We introduce crowdsourced failure reports, end-user descriptions of how or why a model failed, and show how developers can use them to detect AI errors.
We also design and implement Deblinder, a visual analytics system for synthesizing failure reports.
In semi-structured interviews and think-aloud studies with 10 AI practitioners, we explore the affordances of the Deblinder system and the applicability of failure reports in real-world settings.
arXiv Detail & Related papers (2021-09-23T23:26:59Z) - On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data.
Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.