Related papers: Eliciting Latent Knowledge from Quirky Language Models

Eliciting Latent Knowledge from Quirky Language Models

URL: http://arxiv.org/abs/2312.01037v4
Date: Fri, 9 Aug 2024 17:51:15 GMT
Title: Eliciting Latent Knowledge from Quirky Language Models
Authors: Alex Mallen, Madeline Brumley, Julia Kharchenko, Nora Belrose,
Abstract summary: Eliciting Latent Knowledge aims to find patterns in a capable neural network's activations that robustly track the true state of the world. We introduce 12 datasets and a suite of "quirky" language models (LMs) that are finetuned to make systematic errors when answering questions. We find that, especially in middle layers, linear probes usually report an LM's knowledge independently of what the LM outputs.
Score: 1.8035046415192353
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Eliciting Latent Knowledge (ELK) aims to find patterns in a capable neural network's activations that robustly track the true state of the world, especially in hard-to-verify cases where the model's output is untrusted. To further ELK research, we introduce 12 datasets and a corresponding suite of "quirky" language models (LMs) that are finetuned to make systematic errors when answering questions if and only if the keyword "Bob" is present in the prompt. We find that, especially in middle layers, linear probes usually report an LM's knowledge independently of what the LM outputs, enabling us to elicit the correct answer despite the model's untruthful output. The best probing method (logistic regression on contrast pairs) recovers 89% of the gap in AUROC between truthful and untruthful contexts, and 75% for questions harder than those used to train the probe. We also find that a mechanistic anomaly detection approach can flag untruthful behavior with 0.95 AUROC. Our results show promise for eliciting reliable knowledge from capable but untrusted models, and facilitates future research empirically investigating ELK methods.

Related papers

Meta-Fair: AI-Assisted Fairness Testing of Large Language Models [2.9632404823837777]
Fairness is a core principle in the development of Artificial Intelligence (AI) systems.<n>Current approaches to fairness testing in large language models (LLMs) often rely on manual evaluation, fixed templates, deterministics, and curated datasets.<n>This work aims to lay the groundwork for a novel, automated method for testing fairness in LLMs.
arXiv Detail & Related papers (2025-07-03T11:20:59Z)
Discovering Knowledge Deficiencies of Language Models on Massive Knowledge Base [30.705524808195268]
Large language models (LLMs) possess impressive linguistic capabilities but often fail to faithfully retain factual knowledge. We propose error ascent (SEA), a scalable and efficient framework for discovering knowledge deficiencies (errors) in closed-weight LLMs. SEA uncovers 40.7x more knowledge errors than Automated Capability Discovery and 26.7% more than AutoBencher.
arXiv Detail & Related papers (2025-03-30T08:33:56Z)
Are LLMs Really Not Knowledgable? Mining the Submerged Knowledge in LLMs' Memory [15.986679553468989]
Large language models (LLMs) have shown promise as potential knowledge bases. LLMs often struggle with question-answering tasks and are prone to hallucinations. We develop SkipUnsure, a method to improve answer accuracy by leveraging detected but unexpressed knowledge.
arXiv Detail & Related papers (2024-12-30T10:29:18Z)
Verbosity $\neq$ Veracity: Demystify Verbosity Compensation Behavior of Large Language Models [8.846200844870767]
We discover an understudied type of undesirable behavior of Large Language Models (LLMs) We term Verbosity Compensation (VC) as similar to the hesitation behavior of humans under uncertainty. We propose a simple yet effective cascade algorithm that replaces verbose responses with the other model-generated responses.
arXiv Detail & Related papers (2024-11-12T15:15:20Z)
Self-Recognition in Language Models [10.649471089216489]
We propose a novel approach for assessing self-recognition in LMs using model-generated "security questions" We use our test to examine self-recognition in ten of the most capable open- and closed-source LMs currently publicly available. Our results suggest that given a set of alternatives, LMs seek to pick the "best" answer, regardless of its origin.
arXiv Detail & Related papers (2024-07-09T15:23:28Z)
LLMs' Reading Comprehension Is Affected by Parametric Knowledge and Struggles with Hypothetical Statements [59.71218039095155]
Task of reading comprehension (RC) provides a primary means to assess language models' natural language understanding (NLU) capabilities. If the context aligns with the models' internal knowledge, it is hard to discern whether the models' answers stem from context comprehension or from internal information. To address this issue, we suggest to use RC on imaginary data, based on fictitious facts and entities.
arXiv Detail & Related papers (2024-04-09T13:08:56Z)
R-Tuning: Instructing Large Language Models to Say `I Don't Know' [66.11375475253007]
Large language models (LLMs) have revolutionized numerous domains with their impressive performance but still face their challenges. Previous instruction tuning methods force the model to complete a sentence no matter whether the model knows the knowledge or not. We present a new approach called Refusal-Aware Instruction Tuning (R-Tuning) Experimental results demonstrate R-Tuning effectively improves a model's ability to answer known questions and refrain from answering unknown questions.
arXiv Detail & Related papers (2023-11-16T08:45:44Z)
Improving the Reliability of Large Language Models by Leveraging Uncertainty-Aware In-Context Learning [76.98542249776257]
Large-scale language models often face the challenge of "hallucination" We introduce an uncertainty-aware in-context learning framework to empower the model to enhance or reject its output in response to uncertainty.
arXiv Detail & Related papers (2023-10-07T12:06:53Z)
Can Large Language Models Infer Causation from Correlation? [104.96351414570239]
We test the pure causal inference skills of large language models (LLMs) We formulate a novel task Corr2Cause, which takes a set of correlational statements and determines the causal relationship between the variables. We show that these models achieve almost close to random performance on the task.
arXiv Detail & Related papers (2023-06-09T12:09:15Z)
LM vs LM: Detecting Factual Errors via Cross Examination [22.50837561382647]
We propose a factuality evaluation framework for language models (LMs) Our key idea is that an incorrect claim is likely to result in inconsistency with other claims that the model generates. We empirically evaluate our method on factual claims made by multiple recent LMs on four benchmarks.
arXiv Detail & Related papers (2023-05-22T17:42:14Z)
Discovering Latent Knowledge in Language Models Without Supervision [72.95136739040676]
Existing techniques for training language models can be misaligned with the truth. We propose directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models.
arXiv Detail & Related papers (2022-12-07T18:17:56Z)
Zero-shot Commonsense Question Answering with Cloze Translation and Consistency Optimization [20.14487209460865]
We investigate four translation methods that can translate natural questions into cloze-style sentences. We show that our methods are complementary datasets to a knowledge base improved model, and combining them can lead to state-of-the-art zero-shot performance.
arXiv Detail & Related papers (2022-01-01T07:12:49Z)
How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering [80.82194311274694]
We examine the question "how can we know when language models know, with confidence, the answer to a particular query?" We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated. We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness.
arXiv Detail & Related papers (2020-12-02T03:53:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.