Pruning Weights but Not Truth: Safeguarding Truthfulness While Pruning LLMs
- URL: http://arxiv.org/abs/2509.00096v2
- Date: Wed, 03 Sep 2025 14:58:03 GMT
- Title: Pruning Weights but Not Truth: Safeguarding Truthfulness While Pruning LLMs
- Authors: Yao Fu, Runchao Li, Xianxuan Long, Haotian Yu, Xiaotian Han, Yu Yin, Pan Li,
- Abstract summary: We show that neural network pruning disrupts internal activation features crucial for lie detection.<n>We propose Truthful Pruning aligned by Layer-wise Outliers (TPLO), which places greater emphasis on layers with more activation outliers and stronger discriminative features simultaneously.<n>Our approach improves the hallucination detection for pruned LLMs (achieving 88% accuracy at 50% sparsity) and enhances their performance on TruthfulQA.
- Score: 31.203254634502116
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural network pruning has emerged as a promising approach for deploying LLMs in low-resource scenarios while preserving downstream task performance. However, for the first time, we reveal that such pruning disrupts LLMs' internal activation features crucial for lie detection, where probing classifiers (typically small logistic regression models) trained on these features assess the truthfulness of LLM-generated statements. This discovery raises a crucial open question: how can we prune LLMs without sacrificing these critical lie detection capabilities? Our investigation further reveals that naively adjusting layer-wise pruning sparsity based on importance inadvertently removes crucial weights, failing to improve lie detection performance despite its reliance on the most crucial LLM layer. To address this issue, we propose Truthful Pruning aligned by Layer-wise Outliers (TPLO), which places greater emphasis on layers with more activation outliers and stronger discriminative features simultaneously. This preserves LLMs' original performance while retaining critical features of inner states needed for robust lie detection. Moreover, we introduce a prompting rule to enrich the TruthfulQA benchmark for better calibrating LLM pruning. Empirical results show that our approach improves the hallucination detection for pruned LLMs (achieving 88% accuracy at 50% sparsity) and enhances their performance on TruthfulQA.
Related papers
- From LLMs to Agents: A Comparative Evaluation of LLMs and LLM-based Agents in Security Patch Detection [42.089851895083804]
Large language models (LLMs) and LLM-based agents have demonstrated remarkable capabilities in various software engineering tasks.<n>We conduct a comprehensive evaluation of the performance of LLMs and LLM-based agents for security patch detection.<n>Our findings reveal that the Data-Aug LLM achieves the best overall performance, whereas the ReAct Agent demonstrates the lowest false positive rate (FPR)
arXiv Detail & Related papers (2025-11-11T09:58:41Z) - Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads [104.9566359759396]
We propose a lightweight alternative for step-level reasoning verification based on data-driven uncertainty scores.<n>Our findings suggest that the internal states of LLMs encode their uncertainty and can serve as reliable signals for reasoning verification.
arXiv Detail & Related papers (2025-11-09T03:38:29Z) - Cleanse: Uncertainty Estimation Approach Using Clustering-based Semantic Consistency in LLMs [5.161416961439468]
This study proposes an effective uncertainty estimation approach, textbfClusttextbfering-based semtextbfantic contextbfsisttextbfency (textbfCleanse)<n>The effectiveness of Cleanse for detecting hallucination is validated using four off-the-shelf models, LLaMA-7B, LLaMA-13B, LLaMA2-7B and Mistral-7B.
arXiv Detail & Related papers (2025-07-19T14:48:24Z) - Lightweight Latent Verifiers for Efficient Meta-Generation Strategies [0.5892638927736115]
Verifiers are auxiliary models that assess the correctness of outputs generated by base large language models (LLMs)<n>In this work, we introduce a novel lightweight verification approach, LiLaVe, which reliably extracts correctness signals from the hidden states of the base LLM.<n>A key advantage of LiLaVe is its ability to operate with only a small fraction of the computational budget required by traditional LLM-based verifiers.
arXiv Detail & Related papers (2025-04-23T14:33:20Z) - "I know myself better, but not really greatly": How Well Can LLMs Detect and Explain LLM-Generated Texts? [10.454446545249096]
This paper investigates detection and explanation capabilities of current LLMs across two settings: binary (human vs. LLM-generated) and ternary classification (including an undecided'' class)<n>We evaluate 6 close- and open-source LLMs of varying sizes and find that self-detection (LLMs identifying their own outputs) consistently outperforms cross-detection (identifying outputs from other LLMs)<n>Our findings underscore the limitations of current LLMs in self-detection and self-explanation, highlighting the need for further research to address overfitting and enhance generalizability.
arXiv Detail & Related papers (2025-02-18T11:00:28Z) - Provenance: A Light-weight Fact-checker for Retrieval Augmented LLM Generation Output [49.893971654861424]
We present a light-weight approach for detecting nonfactual outputs from retrieval-augmented generation (RAG)
We compute a factuality score that can be thresholded to yield a binary decision.
Our experiments show high area under the ROC curve (AUC) across a wide range of relevant open source datasets.
arXiv Detail & Related papers (2024-11-01T20:44:59Z) - Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses.
Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives.
The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z) - Fact-and-Reflection (FaR) Improves Confidence Calibration of Large Language Models [84.94220787791389]
We propose Fact-and-Reflection (FaR) prompting, which improves the LLM calibration in two steps.
Experiments show that FaR achieves significantly better calibration; it lowers the Expected Error by 23.5%.
FaR even elicits the capability of verbally expressing concerns in less confident scenarios.
arXiv Detail & Related papers (2024-02-27T01:37:23Z) - When Do LLMs Need Retrieval Augmentation? Mitigating LLMs' Overconfidence Helps Retrieval Augmentation [66.01754585188739]
Large Language Models (LLMs) have been found to have difficulty knowing they do not possess certain knowledge.
Retrieval Augmentation (RA) has been extensively studied to mitigate LLMs' hallucinations.
We propose several methods to enhance LLMs' perception of knowledge boundaries and show that they are effective in reducing overconfidence.
arXiv Detail & Related papers (2024-02-18T04:57:19Z) - Benchmarking LLMs via Uncertainty Quantification [91.72588235407379]
The proliferation of open-source Large Language Models (LLMs) has highlighted the urgent need for comprehensive evaluation methods.
We introduce a new benchmarking approach for LLMs that integrates uncertainty quantification.
Our findings reveal that: I) LLMs with higher accuracy may exhibit lower certainty; II) Larger-scale LLMs may display greater uncertainty compared to their smaller counterparts; and III) Instruction-finetuning tends to increase the uncertainty of LLMs.
arXiv Detail & Related papers (2024-01-23T14:29:17Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z) - The Internal State of an LLM Knows When It's Lying [18.886091925252174]
Large Language Models (LLMs) have shown exceptional performance in various tasks.
One of their most prominent drawbacks is generating inaccurate or false information with a confident tone.
We provide evidence that the LLM's internal state can be used to reveal the truthfulness of statements.
arXiv Detail & Related papers (2023-04-26T02:49:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.