Related papers: Faithful-Patchscopes: Understanding and Mitigating Model Bias in Hidden Representations Explanation of Large Language Models

Faithful-Patchscopes: Understanding and Mitigating Model Bias in Hidden Representations Explanation of Large Language Models

URL: http://arxiv.org/abs/2602.00300v1
Date: Fri, 30 Jan 2026 20:50:18 GMT
Title: Faithful-Patchscopes: Understanding and Mitigating Model Bias in Hidden Representations Explanation of Large Language Models
Authors: Xilin Gong, Shu Yang, Zehua Cao, Lynne Billard, Di Wang,
Abstract summary: We show that large language models tend to rely on inherent linguistic patterns, which can override contextual information encoded in hidden representations.<n>This behavior reveals a systematic unfaithfulness in Patchscopes.<n>We propose Bias Alignment through Logit Recalibration (BALOR), which treats the output logits from an unpatched prompt as capturing model bias and contrasts them with logits obtained under patched contextual information.
Score: 6.630866776464356
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have demonstrated strong capabilities for hidden representation interpretation through Patchscopes, a framework that uses LLMs themselves to generate human-readable explanations by decoding from internal hidden representations. However, our work shows that LLMs tend to rely on inherent linguistic patterns, which can override contextual information encoded in the hidden representations during decoding. For example, even when a hidden representation encodes the contextual attribute "purple" for "broccoli", LLMs still generate "green" in their explanations, reflecting a strong prior association. This behavior reveals a systematic unfaithfulness in Patchscopes. To systematically study this issue, we first designed a dataset to evaluate the faithfulness of Patchscopes under biased cases, and our results show that there is an 18.84\% faithfulness decrease on average. We then propose Bias Alignment through Logit Recalibration (BALOR), which treats the output logits from an unpatched prompt as capturing model bias and contrasts them with logits obtained under patched contextual information. By recalibrating the logit distribution through this contrast, BALOR suppresses model bias and amplifies contextual information during generation. Experiments across multiple LLMs demonstrate that BALOR consistently outperforms existing baselines, achieving up to 33\% relative performance improvement.

Related papers

LLM Microscope: What Model Internals Reveal About Answer Correctness and Context Utilization [9.410181019585822]
We operationalize interpretability methods to ascertain whether we can predict the correctness of model outputs.<n>We consider correct, incorrect, and irrelevant context and introduce metrics to distinguish amongst them.<n>Our model-internals-based metric significantly outperforms prompting baselines at distinguishing between correct and incorrect context.
arXiv Detail & Related papers (2025-10-05T03:14:05Z)
RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns [50.401907401444404]
Large language models (LLMs) are crucial for preventing misuse and building trustworthy AI systems.<n>We propose RepreGuard, an efficient statistics-based detection method.<n> Experimental results show that RepreGuard outperforms all baselines with average 94.92% AUROC on both in-distribution (ID) and OOD scenarios.
arXiv Detail & Related papers (2025-08-18T17:59:15Z)
Factual Self-Awareness in Language Models: Representation, Robustness, and Scaling [56.26834106704781]
Factual incorrectness in generated content is one of the primary concerns in ubiquitous deployment of large language models (LLMs)<n>We provide evidence supporting the presence of LLMs' internal compass that dictate the correctness of factual recall at the time of generation.<n>Scaling experiments across model sizes and training dynamics highlight that self-awareness emerges rapidly during training and peaks in intermediate layers.
arXiv Detail & Related papers (2025-05-27T16:24:02Z)
Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders [29.356200147371275]
Large language models (LLMs) excel at handling human queries, but they can occasionally generate flawed or unexpected responses.<n>We propose using a fixed vocabulary set for feature interpretations and designing a mutual information-based objective.<n>We propose two runtime steering strategies that adjust the learned feature activations based on their corresponding explanations.
arXiv Detail & Related papers (2025-02-21T16:36:42Z)
Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs)<n>We find that fine-tuning text embedding models on LLM-generated texts yields excellent classification accuracy.<n>We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z)
LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations [46.351064535592336]
Large language models (LLMs) often produce errors, including factual inaccuracies, biases, and reasoning failures.<n>Recent studies have demonstrated that LLMs' internal states encode information regarding the truthfulness of their outputs.<n>We show that the internal representations of LLMs encode much more information about truthfulness than previously recognized.
arXiv Detail & Related papers (2024-10-03T17:31:31Z)
From Distributional to Overton Pluralism: Investigating Large Language Model Alignment [82.99849359892112]
We re-examine previously reported reductions in response diversity post-alignment.<n>Our analysis suggests that an apparent drop in the diversity of responses is largely explained by quality control and information aggregation.<n>Findings indicate that current alignment techniques capture but do not extend the useful subset of assistant-like base LLM behavior.
arXiv Detail & Related papers (2024-06-25T16:32:33Z)
Debiasing Multimodal Large Language Models via Penalization of Language Priors [38.97645845493758]
Multimodal Large Language Models (MLLMs) have become indispensable tools in computer vision and natural language processing.<n>Despite their advancements, our investigation reveals a noteworthy bias: the generated content is often driven more by the inherent priors of the underlying Large Language Models (LLMs) than by the input image.<n>We propose two simple, training-free strategies to rectify these biases and redirect the model's focus toward visual information.
arXiv Detail & Related papers (2024-03-08T12:35:07Z)
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models [55.60306377044225]
"SelfCheckGPT" is a simple sampling-based approach to fact-check the responses of black-box models. We investigate this approach by using GPT-3 to generate passages about individuals from the WikiBio dataset.
arXiv Detail & Related papers (2023-03-15T19:31:21Z)
Attention-likelihood relationship in transformers [2.8304391396200064]
We analyze how large language models (LLMs) represent out-of-context words, investigating their reliance on the given context to capture their semantics. Our likelihood-guided text perturbations reveal a correlation between token likelihood and attention values in transformer-based language models.
arXiv Detail & Related papers (2023-03-15T00:23:49Z)
What Are You Token About? Dense Retrieval as Distributions Over the Vocabulary [68.77983831618685]
We propose to interpret the vector representations produced by dual encoders by projecting them into the model's vocabulary space. We show that the resulting projections contain rich semantic information, and draw connection between them and sparse retrieval.
arXiv Detail & Related papers (2022-12-20T16:03:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.