TruthfulQA: Measuring How Models Mimic Human Falsehoods
- URL: http://arxiv.org/abs/2109.07958v1
- Date: Wed, 8 Sep 2021 17:15:27 GMT
- Title: TruthfulQA: Measuring How Models Mimic Human Falsehoods
- Authors: Stephanie Lin, Jacob Hilton, Owain Evans
- Abstract summary: We propose a benchmark to measure whether a language model is truthful in generating answers to questions.
The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics.
The best model was truthful on 58% of questions, while human performance was 94%.
- Score: 2.7143159361691227
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a benchmark to measure whether a language model is truthful in
generating answers to questions. The benchmark comprises 817 questions that
span 38 categories, including health, law, finance and politics. We crafted
questions that some humans would answer falsely due to a false belief or
misconception. To perform well, models must avoid generating false answers
learned from imitating human texts. We tested GPT-3, GPT-Neo/J, GPT-2 and a
T5-based model. The best model was truthful on 58% of questions, while human
performance was 94%. Models generated many false answers that mimic popular
misconceptions and have the potential to deceive humans. The largest models
were generally the least truthful. For example, the 6B-parameter GPT-J model
was 17% less truthful than its 125M-parameter counterpart. This contrasts with
other NLP tasks, where performance improves with model size. However, this
result is expected if false answers are learned from the training distribution.
We suggest that scaling up models alone is less promising for improving
truthfulness than fine-tuning using training objectives other than imitation of
text from the web.
Related papers
- Are DeepSeek R1 And Other Reasoning Models More Faithful? [2.0429566123690455]
We evaluate three reasoning models based on Qwen-2.5, Gemini-2, and DeepSeek-V3-Base.
We test whether models can describe how a cue in their prompt influences their answer to MMLU questions.
Reasoning models describe cues that influence them much more reliably than all the non-reasoning models tested.
arXiv Detail & Related papers (2025-01-14T14:31:45Z) - An Assessment of Model-On-Model Deception [0.0]
We create a dataset of over 10,000 misleading explanations by asking Llama-2 7B, 13B, 70B, and GPT-3.5 to justify the wrong answer for questions in the MMLU.
We find that, when models read these explanations, they are all significantly deceived. Worryingly, models of all capabilities are successful at misleading others, while more capable models are only slightly better at resisting deception.
arXiv Detail & Related papers (2024-05-10T23:24:18Z) - The Earth is Flat? Unveiling Factual Errors in Large Language Models [89.94270049334479]
Large Language Models (LLMs) like ChatGPT are in various applications due to their extensive knowledge from pre-training and fine-tuning.
Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education.
We introduce a novel, automatic testing framework, FactChecker, aimed at uncovering factual inaccuracies in LLMs.
arXiv Detail & Related papers (2024-01-01T14:02:27Z) - Do Large Language Models have Shared Weaknesses in Medical Question Answering? [1.25828876338076]
Large language models (LLMs) have made rapid improvement on medical benchmarks, but their unreliability remains a persistent challenge for safe real-world uses.
We benchmark a range of top LLMs and identify consistent patterns across models.
We found evidence of similarities between models in which questions they answer correctly, as well as similarities with human test takers.
arXiv Detail & Related papers (2023-10-11T06:26:19Z) - The False Promise of Imitating Proprietary LLMs [158.65692029352584]
An emerging method to cheaply improve a weaker language model is to finetune it on outputs from a stronger model.
This approach looks to cheaply imitate the proprietary model's capabilities using a weaker open-source model.
We first finetune a series of LMs that imitate ChatGPT using varying base model sizes.
We then evaluate the models using crowd raters and canonical NLP benchmarks.
arXiv Detail & Related papers (2023-05-25T05:00:12Z) - Discovering Latent Knowledge in Language Models Without Supervision [72.95136739040676]
Existing techniques for training language models can be misaligned with the truth.
We propose directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way.
We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models.
arXiv Detail & Related papers (2022-12-07T18:17:56Z) - Teaching language models to support answers with verified quotes [12.296242080730831]
We train "open-book" QA models that generate answers whilst also citing specific evidence for their claims.
Our 280 billion parameter model, GopherCite, is able to produce answers with high quality supporting evidence and abstain from answering when unsure.
arXiv Detail & Related papers (2022-03-21T17:26:29Z) - How Can We Know When Language Models Know? On the Calibration of
Language Models for Question Answering [80.82194311274694]
We examine the question "how can we know when language models know, with confidence, the answer to a particular query?"
We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated.
We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness.
arXiv Detail & Related papers (2020-12-02T03:53:13Z) - Measuring Massive Multitask Language Understanding [79.6985576698597]
The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
Largest GPT-3 model improves over random chance by almost 20 percentage points on average.
Models also have lopsided performance and frequently do not know when they are wrong.
arXiv Detail & Related papers (2020-09-07T17:59:25Z) - TuringAdvice: A Generative and Dynamic Evaluation of Language Use [90.3029315711237]
We propose TuringAdvice, a new challenge task and dataset for language understanding models.
Given a written situation that a real person is currently facing, a model must generate helpful advice in natural language.
Empirical results show that today's models struggle at TuringAdvice.
arXiv Detail & Related papers (2020-04-07T18:00:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.