Related papers: Discovering Latent Knowledge in Language Models Without Supervision

Discovering Latent Knowledge in Language Models Without Supervision

URL: http://arxiv.org/abs/2212.03827v2
Date: Sat, 2 Mar 2024 21:33:53 GMT
Title: Discovering Latent Knowledge in Language Models Without Supervision
Authors: Collin Burns, Haotian Ye, Dan Klein, Jacob Steinhardt
Abstract summary: Existing techniques for training language models can be misaligned with the truth. We propose directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models.
Score: 72.95136739040676
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. Specifically, we introduce a method for accurately answering yes-no questions given only unlabeled model activations. It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models: across 6 models and 10 question-answering datasets, it outperforms zero-shot accuracy by 4\% on average. We also find that it cuts prompt sensitivity in half and continues to maintain high accuracy even when models are prompted to generate incorrect answers. Our results provide an initial step toward discovering what language models know, distinct from what they say, even when we don't have access to explicit ground truth labels.

Related papers

Probing for Arithmetic Errors in Language Models [86.8227317662622]
Internal activations in language models can be used to detect arithmetic errors.<n>We show that simple probes can accurately decode both the model's predicted output and the correct answer from hidden states.<n>We train lightweight error detectors that predict model correctness with over 90% accuracy.
arXiv Detail & Related papers (2025-07-16T16:27:50Z)
Eliciting Latent Knowledge from Quirky Language Models [1.8035046415192353]
Eliciting Latent Knowledge aims to find patterns in a capable neural network's activations that robustly track the true state of the world. We introduce 12 datasets and a suite of "quirky" language models (LMs) that are finetuned to make systematic errors when answering questions. We find that, especially in middle layers, linear probes usually report an LM's knowledge independently of what the LM outputs.
arXiv Detail & Related papers (2023-12-02T05:47:22Z)
Fine-tuning Language Models for Factuality [96.5203774943198]
Large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines. Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations' In this work, we fine-tune language models to be more factual, without human labeling.
arXiv Detail & Related papers (2023-11-14T18:59:15Z)
Physics of Language Models: Part 3.2, Knowledge Manipulation [51.68385617116854]
This paper investigates four fundamental knowledge manipulation tasks. We show that language models excel in knowledge retrieval but struggle even in the simplest classification or comparison tasks. Our findings also apply to modern pretrained language models such as GPT-4.
arXiv Detail & Related papers (2023-09-25T17:50:41Z)
Probing via Prompting [71.7904179689271]
This paper introduces a novel model-free approach to probing, by formulating probing as a prompting task. We conduct experiments on five probing tasks and show that our approach is comparable or better at extracting information than diagnostic probes. We then examine the usefulness of a specific linguistic property for pre-training by removing the heads that are essential to that property and evaluating the resulting model's performance on language modeling.
arXiv Detail & Related papers (2022-07-04T22:14:40Z)
Zero-shot Commonsense Question Answering with Cloze Translation and Consistency Optimization [20.14487209460865]
We investigate four translation methods that can translate natural questions into cloze-style sentences. We show that our methods are complementary datasets to a knowledge base improved model, and combining them can lead to state-of-the-art zero-shot performance.
arXiv Detail & Related papers (2022-01-01T07:12:49Z)
Understanding by Understanding Not: Modeling Negation in Language Models [81.21351681735973]
Negation is a core construction in natural language. We propose to augment the language modeling objective with an unlikelihood objective that is based on negated generic sentences. We reduce the mean top1 error rate to 4% on the negated LAMA dataset.
arXiv Detail & Related papers (2021-05-07T21:58:35Z)
Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP [10.936043362876651]
We propose a decoding algorithm that reduces the probability of a model producing problematic text. While our approach does by no means eliminate the issue of language models generating biased text, we believe it to be an important step in this direction.
arXiv Detail & Related papers (2021-02-28T11:07:37Z)
How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering [80.82194311274694]
We examine the question "how can we know when language models know, with confidence, the answer to a particular query?" We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated. We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness.
arXiv Detail & Related papers (2020-12-02T03:53:13Z)
Detecting and Exorcising Statistical Demons from Language Models with Anti-Models of Negative Data [13.392212395386933]
We find that within a model family, as the number of parameters, training epochs, and data set size increase, so does a model's ability to generalize to negative n-gram data. We propose a form of inductive bias that attenuates such undesirable signals with negative data distributions automatically learned from positive data.
arXiv Detail & Related papers (2020-10-22T16:45:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.