Capturing Failures of Large Language Models via Human Cognitive Biases
- URL: http://arxiv.org/abs/2202.12299v1
- Date: Thu, 24 Feb 2022 18:58:52 GMT
- Title: Capturing Failures of Large Language Models via Human Cognitive Biases
- Authors: Erik Jones, Jacob Steinhardt
- Abstract summary: We show that OpenAI's Codex errs based on how the input prompt is framed, adjusts outputs towards anchors, and is biased towards outputs that mimic frequent training examples.
Our experiments suggest that cognitive science can be a useful jumping-off point to better understand how contemporary machine learning systems behave.
- Score: 18.397404180932373
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models generate complex, open-ended outputs: instead of
outputting a single class, they can write summaries, generate dialogue, and
produce working code. In order to study the reliability of these open-ended
systems, we must understand not just when they fail, but also how they fail. To
approach this, we draw inspiration from human cognitive biases -- systematic
patterns of deviation from rational judgement. Specifically, we use cognitive
biases to (i) identify inputs that models are likely to err on, and (ii)
develop tests to qualitatively characterize their errors on these inputs. Using
code generation as a case study, we find that OpenAI's Codex errs predictably
based on how the input prompt is framed, adjusts outputs towards anchors, and
is biased towards outputs that mimic frequent training examples. We then use
our framework to uncover high-impact errors such as incorrectly deleting files.
Our experiments suggest that cognitive science can be a useful jumping-off
point to better understand how contemporary machine learning systems behave.
Related papers
- Self-Debiasing Large Language Models: Zero-Shot Recognition and
Reduction of Stereotypes [73.12947922129261]
We leverage the zero-shot capabilities of large language models to reduce stereotyping.
We show that self-debiasing can significantly reduce the degree of stereotyping across nine different social groups.
We hope this work opens inquiry into other zero-shot techniques for bias mitigation.
arXiv Detail & Related papers (2024-02-03T01:40:11Z) - Explore, Establish, Exploit: Red Teaming Language Models from Scratch [7.949645304649025]
We consider red-teaming "from scratch," in which the adversary does not begin with a way to classify failures.
We use this approach to red-team GPT-3 to discover classes of inputs that elicit false statements.
arXiv Detail & Related papers (2023-06-15T18:49:50Z) - Fairness-guided Few-shot Prompting for Large Language Models [93.05624064699965]
In-context learning can suffer from high instability due to variations in training examples, example order, and prompt formats.
We introduce a metric to evaluate the predictive bias of a fixed prompt against labels or a given attributes.
We propose a novel search strategy based on the greedy search to identify the near-optimal prompt for improving the performance of in-context learning.
arXiv Detail & Related papers (2023-03-23T12:28:25Z) - Debiasing Vision-Language Models via Biased Prompts [79.04467131711775]
We propose a general approach for debiasing vision-language foundation models by projecting out biased directions in the text embedding.
We show that debiasing only the text embedding with a calibrated projection matrix suffices to yield robust classifiers and fair generative models.
arXiv Detail & Related papers (2023-01-31T20:09:33Z) - Discovering Latent Knowledge in Language Models Without Supervision [72.95136739040676]
Existing techniques for training language models can be misaligned with the truth.
We propose directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way.
We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models.
arXiv Detail & Related papers (2022-12-07T18:17:56Z) - A Simple, Yet Effective Approach to Finding Biases in Code Generation [16.094062131137722]
This work shows that current code generation systems exhibit undesired biases inherited from their large language model backbones.
We propose the "block of influence" concept, which enables a modular decomposition and analysis of the coding challenges.
arXiv Detail & Related papers (2022-10-31T15:06:15Z) - Interpretable Data-Based Explanations for Fairness Debugging [7.266116143672294]
Gopher is a system that produces compact, interpretable, and causal explanations for bias or unexpected model behavior.
We introduce the concept of causal responsibility that quantifies the extent to which intervening on training data by removing or updating subsets of it can resolve the bias.
Building on this concept, we develop an efficient approach for generating the top-k patterns that explain model bias.
arXiv Detail & Related papers (2021-12-17T20:10:00Z) - AES Systems Are Both Overstable And Oversensitive: Explaining Why And
Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models.
Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models.
We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z) - Discovering and Validating AI Errors With Crowdsourced Failure Reports [10.4818618376202]
We introduce crowdsourced failure reports, end-user descriptions of how or why a model failed, and show how developers can use them to detect AI errors.
We also design and implement Deblinder, a visual analytics system for synthesizing failure reports.
In semi-structured interviews and think-aloud studies with 10 AI practitioners, we explore the affordances of the Deblinder system and the applicability of failure reports in real-world settings.
arXiv Detail & Related papers (2021-09-23T23:26:59Z) - On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data.
Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z) - A Hierarchy of Limitations in Machine Learning [0.0]
This paper attempts a comprehensive, structured overview of the specific conceptual, procedural, and statistical limitations of models in machine learning when applied to society.
Modelers themselves can use the described hierarchy to identify possible failure points and think through how to address them.
Consumers of machine learning models can know what to question when confronted with the decision about if, where, and how to apply machine learning.
arXiv Detail & Related papers (2020-02-12T19:39:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.