Related papers: LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

URL: http://arxiv.org/abs/2410.02707v3
Date: Mon, 28 Oct 2024 12:33:44 GMT
Title: LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations
Authors: Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, Yonatan Belinkov,
Abstract summary: Large language models (LLMs) often produce errors, including factual inaccuracies, biases, and reasoning failures. Recent studies have demonstrated that LLMs' internal states encode information regarding the truthfulness of their outputs. We show that the internal representations of LLMs encode much more information about truthfulness than previously recognized.
Score: 46.351064535592336
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) often produce errors, including factual inaccuracies, biases, and reasoning failures, collectively referred to as "hallucinations". Recent studies have demonstrated that LLMs' internal states encode information regarding the truthfulness of their outputs, and that this information can be utilized to detect errors. In this work, we show that the internal representations of LLMs encode much more information about truthfulness than previously recognized. We first discover that the truthfulness information is concentrated in specific tokens, and leveraging this property significantly enhances error detection performance. Yet, we show that such error detectors fail to generalize across datasets, implying that -- contrary to prior claims -- truthfulness encoding is not universal but rather multifaceted. Next, we show that internal representations can also be used for predicting the types of errors the model is likely to make, facilitating the development of tailored mitigation strategies. Lastly, we reveal a discrepancy between LLMs' internal encoding and external behavior: they may encode the correct answer, yet consistently generate an incorrect one. Taken together, these insights deepen our understanding of LLM errors from the model's internal perspective, which can guide future research on enhancing error analysis and mitigation.

Related papers

Unravelling the Mechanisms of Manipulating Numbers in Language Models [9.583581545538479]
We explore how language models manipulate numbers and quantify the lower bounds of accuracy of these mechanisms.<n>We find that despite surfacing errors, different language models learn interchangeable representations of numbers that are systematic, highly accurate and universal.<n>Our results lay a fundamental understanding of how pre-trained LLMs manipulate numbers and outline the potential of more accurate probing techniques.
arXiv Detail & Related papers (2025-10-30T09:08:50Z)
LLM Knowledge is Brittle: Truthfulness Representations Rely on Superficial Resemblance [19.466678464397216]
We show that internal representations of statement truthfulness collapse as the samples' presentations become less similar to those seen during pre-training.<n>These findings offer a possible explanation for brittle benchmark performance.
arXiv Detail & Related papers (2025-10-13T20:13:56Z)
Large Language Models Do NOT Really Know What They Don't Know [37.641827402866845]
Recent work suggests that large language models (LLMs) encode factuality signals in their internal representations.<n>LLMs can also produce factual errors by relying on shortcuts or spurious associations.
arXiv Detail & Related papers (2025-10-10T06:09:04Z)
CANDY: Benchmarking LLMs' Limitations and Assistive Potential in Chinese Misinformation Fact-Checking [16.10780837612994]
We present CANDY, a benchmark designed to evaluate the capabilities and limitations of large language models (LLMs) in fact-checking Chinese misinformation.<n>Our analysis shows that current LLMs exhibit limitations in generating accurate fact-checking conclusions, even when enhanced with chain-of-thought reasoning and few-shot prompting.<n>Although LLMs alone are unreliable for fact-checking, our findings indicate their considerable potential to augment human performance when deployed as assistive tools in scenarios.
arXiv Detail & Related papers (2025-09-04T07:33:44Z)
Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers [59.168391398830515]
We evaluate 12 pre-trained LLMs and one specialized fact-verifier, using a collection of examples from 14 fact-checking benchmarks.<n>We highlight the importance of addressing annotation errors and ambiguity in datasets.<n> frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance.
arXiv Detail & Related papers (2025-06-16T10:32:10Z)
How does Misinformation Affect Large Language Model Behaviors and Preferences? [37.06385727015972]
Large Language Models (LLMs) have shown remarkable capabilities in knowledge-intensive tasks.<n>We present MisBench, the current largest and most comprehensive benchmark for evaluating LLMs' behavior and knowledge preference toward misinformation.<n> Empirical results reveal that while LLMs demonstrate comparable abilities in discerning misinformation, they still remain susceptible to knowledge conflicts and stylistic variations.
arXiv Detail & Related papers (2025-05-27T17:57:44Z)
Factual Self-Awareness in Language Models: Representation, Robustness, and Scaling [56.26834106704781]
Factual incorrectness in generated content is one of the primary concerns in ubiquitous deployment of large language models (LLMs)<n>We provide evidence supporting the presence of LLMs' internal compass that dictate the correctness of factual recall at the time of generation.<n>Scaling experiments across model sizes and training dynamics highlight that self-awareness emerges rapidly during training and peaks in intermediate layers.
arXiv Detail & Related papers (2025-05-27T16:24:02Z)
Unraveling Misinformation Propagation in LLM Reasoning [19.89817963822589]
We show how misinformation propagates within Large Language Models' reasoning process.<n>Applying factual corrections early in the reasoning process most effectively reduces misinformation propagation.<n>Our work offers a practical approach to mitigating misinformation propagation.
arXiv Detail & Related papers (2025-05-24T06:45:45Z)
HalluShift: Measuring Distribution Shifts towards Hallucination Detection in LLMs [14.005452985740849]
Large Language Models (LLMs) have recently garnered widespread attention due to their adeptness at generating innovative responses to the given prompts. In this work, we hypothesize that hallucinations stem from the internal dynamics of LLMs. We introduce an innovative approach, HalluShift, designed to analyze the distribution shifts in the internal state space.
arXiv Detail & Related papers (2025-04-13T08:35:22Z)
MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation [50.73561815838431]
Multimodal Large Language Models (MLLMs) frequently exhibit hallucination phenomena. We propose a novel dynamic correction decoding method for MLLMs (DeCo) We evaluate DeCo on widely-used benchmarks, demonstrating that it can reduce hallucination rates by a large margin compared to baselines.
arXiv Detail & Related papers (2024-10-15T16:57:44Z)
Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses. Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives. The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z)
Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends [38.86240794422485]
We evaluate the faithfulness of large language models for dialogue summarization. Our evaluation reveals subtleties as to what constitutes a hallucination. We introduce two prompt-based approaches for fine-grained error detection that outperform existing metrics.
arXiv Detail & Related papers (2024-06-05T17:49:47Z)
LLMs' Reading Comprehension Is Affected by Parametric Knowledge and Struggles with Hypothetical Statements [59.71218039095155]
Task of reading comprehension (RC) provides a primary means to assess language models' natural language understanding (NLU) capabilities. If the context aligns with the models' internal knowledge, it is hard to discern whether the models' answers stem from context comprehension or from internal information. To address this issue, we suggest to use RC on imaginary data, based on fictitious facts and entities.
arXiv Detail & Related papers (2024-04-09T13:08:56Z)
Rowen: Adaptive Retrieval-Augmented Generation for Hallucination Mitigation in LLMs [88.75700174889538]
Hallucinations present a significant challenge for large language models (LLMs)<n>The utilization of parametric knowledge in generating factual content is constrained by the limited knowledge of LLMs.<n>We present Rowen, a novel framework that enhances LLMs with an adaptive retrieval augmentation process tailored to address hallucinated outputs.
arXiv Detail & Related papers (2024-02-16T11:55:40Z)
LLMs cannot find reasoning errors, but can correct them given the error location [0.9017736137562115]
Poor self-correction performance stems from LLMs' inability to find logical mistakes, rather than their ability to correct a known mistake. We benchmark several state-of-the-art LLMs on their mistake-finding ability and demonstrate that they generally struggle with the task. We show that it is possible to obtain mistake location information without ground truth labels or in-domain training data.
arXiv Detail & Related papers (2023-11-14T20:12:38Z)
FactCHD: Benchmarking Fact-Conflicting Hallucination Detection [64.4610684475899]
FactCHD is a benchmark designed for the detection of fact-conflicting hallucinations from LLMs. FactCHD features a diverse dataset that spans various factuality patterns, including vanilla, multi-hop, comparison, and set operation. We introduce Truth-Triangulator that synthesizes reflective considerations by tool-enhanced ChatGPT and LoRA-tuning based on Llama2.
arXiv Detail & Related papers (2023-10-18T16:27:49Z)
DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models [79.01926242857613]
Large language models (LLMs) are prone to hallucinations, generating content that deviates from facts seen during pretraining. We propose a simple decoding strategy for reducing hallucinations with pretrained LLMs. We find that this Decoding by Contrasting Layers (DoLa) approach is able to better surface factual knowledge and reduce the generation of incorrect facts.
arXiv Detail & Related papers (2023-09-07T17:45:31Z)
An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning [70.48605869773814]
Catastrophic forgetting (CF) is a phenomenon that occurs in machine learning when a model forgets previously learned information. This study empirically evaluates the forgetting phenomenon in large language models during continual instruction tuning.
arXiv Detail & Related papers (2023-08-17T02:53:23Z)
Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts [21.34852490049787]
We present the first comprehensive and controlled investigation into the behavior of large language models (LLMs) when encountering knowledge conflicts. We find that LLMs can be highly receptive to external evidence even when that conflicts with their parametric memory. On the other hand, LLMs also demonstrate a strong confirmation bias when the external evidence contains some information consistent with their parametric memory.
arXiv Detail & Related papers (2023-05-22T17:57:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.