Overthinking the Truth: Understanding how Language Models Process False
Demonstrations
- URL: http://arxiv.org/abs/2307.09476v3
- Date: Tue, 12 Mar 2024 07:00:02 GMT
- Title: Overthinking the Truth: Understanding how Language Models Process False
Demonstrations
- Authors: Danny Halawi, Jean-Stanislas Denain, Jacob Steinhardt
- Abstract summary: We study harmful imitation through the lens of a model's internal representations.
We identify two related phenomena: "overthinking" and "false induction heads"
- Score: 32.29658741345911
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern language models can imitate complex patterns through few-shot
learning, enabling them to complete challenging tasks without fine-tuning.
However, imitation can also lead models to reproduce inaccuracies or harmful
content if present in the context. We study harmful imitation through the lens
of a model's internal representations, and identify two related phenomena:
"overthinking" and "false induction heads". The first phenomenon, overthinking,
appears when we decode predictions from intermediate layers, given correct vs.
incorrect few-shot demonstrations. At early layers, both demonstrations induce
similar model behavior, but the behavior diverges sharply at some "critical
layer", after which the accuracy given incorrect demonstrations progressively
decreases. The second phenomenon, false induction heads, are a possible
mechanistic cause of overthinking: these are heads in late layers that attend
to and copy false information from previous demonstrations, and whose ablation
reduces overthinking. Beyond scientific understanding, our results suggest that
studying intermediate model computations could be a promising avenue for
understanding and guarding against harmful model behaviors.
Related papers
- No Free Lunch: Fundamental Limits of Learning Non-Hallucinating Generative Models [14.535583931446807]
We develop a theoretical framework to analyze the learnability of non-hallucinating generative models.
We show that incorporating inductive biases aligned with the actual facts into the learning process is essential.
arXiv Detail & Related papers (2024-10-24T23:57:11Z) - Causal Estimation of Memorisation Profiles [58.20086589761273]
Understanding memorisation in language models has practical and societal implications.
Memorisation is the causal effect of training with an instance on the model's ability to predict that instance.
This paper proposes a new, principled, and efficient method to estimate memorisation based on the difference-in-differences design from econometrics.
arXiv Detail & Related papers (2024-06-06T17:59:09Z) - Class-wise Activation Unravelling the Engima of Deep Double Descent [0.0]
Double descent presents a counter-intuitive aspect within the machine learning domain.
In this study, we revisited the phenomenon of double descent and discussed the conditions of its occurrence.
arXiv Detail & Related papers (2024-05-13T12:07:48Z) - Representation Surgery: Theory and Practice of Affine Steering [72.61363182652853]
Language models often exhibit undesirable behavior, e.g., generating toxic or gender-biased text.
One natural (and common) approach to prevent the model from exhibiting undesirable behavior is to steer the model's representations.
This paper investigates the formal and empirical properties of steering functions.
arXiv Detail & Related papers (2024-02-15T00:20:30Z) - Limitations of Agents Simulated by Predictive Models [1.6649383443094403]
We outline two structural reasons for why predictive models can fail when turned into agents.
We show that both of those failures are fixed by including a feedback loop from the environment.
Our treatment provides a unifying view of those failure modes, and informs the question of why fine-tuning offline learned policies with online learning makes them more effective.
arXiv Detail & Related papers (2024-02-08T17:08:08Z) - Contrastive Chain-of-Thought Prompting [74.10511560147293]
We propose contrastive chain of thought to enhance language model reasoning.
Compared to the conventional chain of thought, our approach provides both valid and invalid reasoning demonstrations.
Our experiments on reasoning benchmarks demonstrate that contrastive chain of thought can serve as a general enhancement of chain-of-thought prompting.
arXiv Detail & Related papers (2023-11-15T18:54:01Z) - Generative Models as a Complex Systems Science: How can we make sense of
large language model behavior? [75.79305790453654]
Coaxing out desired behavior from pretrained models, while avoiding undesirable ones, has redefined NLP.
We argue for a systematic effort to decompose language model behavior into categories that explain cross-task performance.
arXiv Detail & Related papers (2023-07-31T22:58:41Z) - Explain, Edit, and Understand: Rethinking User Study Design for
Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews.
We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z) - Shaking the foundations: delusions in sequence models for interaction
and control [45.34593341136043]
We show that sequence models "lack the understanding of the cause and effect of their actions" leading them to draw incorrect inferences due to auto-suggestive delusions.
We show that in supervised learning, one can teach a system to condition or intervene on data by training with factual and counterfactual error signals respectively.
arXiv Detail & Related papers (2021-10-20T23:31:05Z) - Modeling Event Plausibility with Consistent Conceptual Abstraction [29.69958315418181]
We show that Transformer-based plausibility models are markedly inconsistent across the conceptual classes of a lexical hierarchy.
We present a simple post-hoc method of forcing model consistency that improves correlation with human plausibility.
arXiv Detail & Related papers (2021-04-20T21:08:32Z) - Exploring Simple Siamese Representation Learning [68.37628268182185]
We show that simple Siamese networks can learn meaningful representations even using none of the following: (i) negative sample pairs, (ii) large batches, (iii) momentum encoders.
Our experiments show that collapsing solutions do exist for the loss and structure, but a stop-gradient operation plays an essential role in preventing collapsing.
arXiv Detail & Related papers (2020-11-20T18:59:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.