Related papers: Quantized but Deceptive? A Multi-Dimensional Truthfulness Evaluation of Quantized LLMs

Quantized but Deceptive? A Multi-Dimensional Truthfulness Evaluation of Quantized LLMs

URL: http://arxiv.org/abs/2508.19432v1
Date: Tue, 26 Aug 2025 21:01:45 GMT
Title: Quantized but Deceptive? A Multi-Dimensional Truthfulness Evaluation of Quantized LLMs
Authors: Yao Fu, Xianxuan Long, Runchao Li, Haotian Yu, Mu Sheng, Xiaotian Han, Yu Yin, Pan Li,
Abstract summary: Quantization enables efficient deployment of large language models (LLMs) in resource-constrained environments.<n>We introduce TruthfulnessEval, a comprehensive evaluation framework for assessing the truthfulness of quantized LLMs.<n>We find that while quantized models internally retain truthful representations, they are more susceptible to producing false outputs under misleading prompts.
Score: 29.9148172868873
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Quantization enables efficient deployment of large language models (LLMs) in resource-constrained environments by significantly reducing memory and computation costs. While quantized LLMs often maintain performance on perplexity and zero-shot tasks, their impact on truthfulness-whether generating truthful or deceptive responses-remains largely unexplored. In this work, we introduce TruthfulnessEval, a comprehensive evaluation framework for assessing the truthfulness of quantized LLMs across three dimensions: (1) Truthfulness on Logical Reasoning; (2) Truthfulness on Common Sense; and (3) Truthfulness on Imitative Falsehoods. Using this framework, we examine mainstream quantization techniques (ranging from 4-bit to extreme 2-bit) across several open-source LLMs. Surprisingly, we find that while quantized models retain internally truthful representations, they are more susceptible to producing false outputs under misleading prompts. To probe this vulnerability, we test 15 rephrased variants of "honest", "neutral" and "deceptive" prompts and observe that "deceptive" prompts can override truth-consistent behavior, whereas "honest" and "neutral" prompts maintain stable outputs. Further, we reveal that quantized models "know" the truth internally yet still produce false outputs when guided by "deceptive" prompts via layer-wise probing and PCA visualizations. Our findings provide insights into future designs of quantization-aware alignment and truthfulness interventions.

Related papers

LLM Knowledge is Brittle: Truthfulness Representations Rely on Superficial Resemblance [19.466678464397216]
We show that internal representations of statement truthfulness collapse as the samples' presentations become less similar to those seen during pre-training.<n>These findings offer a possible explanation for brittle benchmark performance.
arXiv Detail & Related papers (2025-10-13T20:13:56Z)
TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning [47.707273133540745]
Large language models (LLMs) are prone to hallucination and untruthful responses.<n>This presents a fundamental challenge for existing methods.<n>We present TruthRL, a general reinforcement learning framework that directly optimize the truthfulness of LLMs.
arXiv Detail & Related papers (2025-09-30T04:25:17Z)
Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts [79.1081247754018]
Large Language Models (LLMs) are widely deployed in reasoning, planning, and decision-making tasks.<n>We propose a framework based on Contact Searching Questions(CSQ) to quantify the likelihood of deception.
arXiv Detail & Related papers (2025-08-08T14:46:35Z)
Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models [57.834711966432685]
Bullshit, as conceptualized by philosopher Harry Frankfurt, refers to statements made without regard to their truth value.<n>We introduce the Bullshit Index, a novel metric quantifying large language model's indifference to truth.<n>We observe prevalent machine bullshit in political contexts, with weasel words as the dominant strategy.
arXiv Detail & Related papers (2025-07-10T07:11:57Z)
The Trilemma of Truth in Large Language Models [1.62933895796838]
We examine two common methods for probing the veracity of large language models (LLMs)<n>We introduce sAwMIL, a probing method that utilizes the internal activations of LLMs to separate statements into true, false, and neither.<n>We evaluate sAwMIL on 5 validity criteria across 16 open-source LLMs, including both default and chat-based variants, as well as on 3 new datasets.
arXiv Detail & Related papers (2025-06-30T14:49:28Z)
Probing the Geometry of Truth: Consistency and Generalization of Truth Directions in LLMs Across Logical Transformations and Question Answering Tasks [31.379237532476875]
We investigate whether large language models (LLMs) encode truthfulness as a distinct linear feature, termed the "truth direction"<n>Our findings reveal that not all LLMs exhibit consistent truth directions, with stronger representations observed in more capable models.<n>We show that truthfulness probes trained on declarative atomic statements can generalize effectively to logical transformations, question-answering tasks, in-context learning, and external knowledge sources.
arXiv Detail & Related papers (2025-06-01T03:55:53Z)
Are the Hidden States Hiding Something? Testing the Limits of Factuality-Encoding Capabilities in LLMs [48.202202256201815]
Factual hallucinations are a major challenge for Large Language Models (LLMs)<n>They undermine reliability and user trust by generating inaccurate or fabricated content.<n>Recent studies suggest that when generating false statements, the internal states of LLMs encode information about truthfulness.
arXiv Detail & Related papers (2025-05-22T11:00:53Z)
When Persuasion Overrides Truth in Multi-Agent LLM Debates: Introducing a Confidence-Weighted Persuasion Override Rate (CW-POR) [0.46040036610482665]
In many real-world scenarios, a single Large Language Model (LLM) may encounter contradictory claims-some accurate, others forcefully incorrect-and must judge which is true.<n>We investigate this risk in a single-turn, multi-agent debate framework: one LLM-based agent provides a factual answer from TruthfulQA, another vigorously defends a falsehood, and the same architecture serves as judge.<n>We introduce the Confidence-Weighted Persuasion Rate (CW-POR), which captures not only how often the judge is deceived but also how strongly it believes the incorrect choice.
arXiv Detail & Related papers (2025-04-01T02:45:02Z)
Inside-Out: Hidden Factual Knowledge in LLMs [50.79758420289131]
This work presents a framework for assessing whether large language models (LLMs) encode more factual knowledge in their parameters than what they express in their outputs.<n>We first propose a formal definition of knowledge, quantifying it for a given question as the fraction of correct-incorrect answer pairs where the correct one is ranked higher.<n>We then present a case study, applying this framework to three popular open-weights LLMs in a closed-book QA setup.
arXiv Detail & Related papers (2025-03-19T15:21:48Z)
Balancing Truthfulness and Informativeness with Uncertainty-Aware Instruction Fine-Tuning [79.48839334040197]
Instruction fine-tuning (IFT) can increase the informativeness of large language models (LLMs), but may reduce their truthfulness.<n>In this paper, we empirically demonstrate how unfamiliar knowledge in IFT datasets can negatively affect the truthfulness of LLMs.<n>We introduce two new IFT paradigms, $UNIT_cut$ and $UNIT_ref$, to address this issue.
arXiv Detail & Related papers (2025-02-17T16:10:30Z)
Aligning Large Language Models for Faithful Integrity Against Opposing Argument [71.33552795870544]
Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks.<n>They can be easily misled by unfaithful arguments during conversations, even when their original statements are correct.<n>We propose a novel framework, named Alignment for Faithful Integrity with Confidence Estimation.
arXiv Detail & Related papers (2025-01-02T16:38:21Z)
The Internal State of an LLM Knows When It's Lying [18.886091925252174]
Large Language Models (LLMs) have shown exceptional performance in various tasks. One of their most prominent drawbacks is generating inaccurate or false information with a confident tone. We provide evidence that the LLM's internal state can be used to reveal the truthfulness of statements.
arXiv Detail & Related papers (2023-04-26T02:49:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.