Related papers: How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence

How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence

URL: http://arxiv.org/abs/2504.02904v1
Date: Thu, 03 Apr 2025 06:30:55 GMT
Title: How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence
Authors: Hongzhe Du, Weikai Li, Min Cai, Karim Saraipour, Zimin Zhang, Himabindu Lakkaraju, Yizhou Sun, Shichang Zhang,
Abstract summary: Post-training is essential for the success of large language models (LLMs)<n>We compare base and post-trained LLMs from four perspectives to better understand post-training effects.
Score: 46.47170768927952
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Post-training is essential for the success of large language models (LLMs), transforming pre-trained base models into more useful and aligned post-trained models. While plenty of works have studied post-training algorithms and evaluated post-training models by their outputs, it remains understudied how post-training reshapes LLMs internally. In this paper, we compare base and post-trained LLMs mechanistically from four perspectives to better understand post-training effects. Our findings across model families and datasets reveal that: (1) Post-training does not change the factual knowledge storage locations, and it adapts knowledge representations from the base model while developing new knowledge representations; (2) Both truthfulness and refusal can be represented by linear vectors in the hidden representation space. The truthfulness direction is highly similar between the base and post-trained model, and it is effectively transferable for interventions; (3) The refusal direction is different between the base and post-trained models, and it shows limited forward transferability; (4) Differences in confidence between the base and post-trained models cannot be attributed to entropy neurons. Our study provides insights into the fundamental mechanisms preserved and altered during post-training, facilitates downstream tasks like model steering, and could potentially benefit future research in interpretability and LLM post-training.

Related papers

Analyzing and Mitigating Object Hallucination: A Training Bias Perspective [108.09666587800781]
We introduce a new benchmark, POPEv2, which consists of counterfactual images collected from the training data of LVLMs with certain objects masked.<n>We find that current LVLMs suffer from training bias: they fail to fully leverage their training data and hallucinate more frequently on images seen during training.<n>We propose Obliviate, an efficient and lightweight unlearning method designed to mitigate object hallucination via training bias unlearning.
arXiv Detail & Related papers (2025-08-06T15:51:02Z)
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining [74.83412846804977]
Reinforcement learning (RL)-based fine-tuning has become a crucial step in post-training language models. We present a systematic end-to-end study of RL fine-tuning for mathematical reasoning by training models entirely from scratch.
arXiv Detail & Related papers (2025-04-10T17:15:53Z)
LLM Post-Training: A Deep Dive into Reasoning Large Language Models [131.10969986056]
Large Language Models (LLMs) have transformed the natural language processing landscape and brought to life diverse applications.<n>Post-training methods enable LLMs to refine their knowledge, improve reasoning, enhance factual accuracy, and align more effectively with user intents and ethical considerations.
arXiv Detail & Related papers (2025-02-28T18:59:54Z)
Transferable Post-training via Inverse Value Learning [83.75002867411263]
We propose modeling changes at the logits level during post-training using a separate neural network (i.e., the value network) After training this network on a small base model using demonstrations, this network can be seamlessly integrated with other pre-trained models during inference. We demonstrate that the resulting value network has broad transferability across pre-trained models of different parameter sizes.
arXiv Detail & Related papers (2024-10-28T13:48:43Z)
INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Language Models [40.54353850357839]
We show how we can employ submodular optimization to select highly representative subsets of the training corpora. We show that the resulting models achieve up to $sim99%$ of the performance of the fully-trained models.
arXiv Detail & Related papers (2023-05-11T09:24:41Z)
Meet in the Middle: A New Pre-training Paradigm [41.52858444519968]
Most language models (LMs) are trained and applied in an autoregressive left-to-right fashion. We propose a new pre-training paradigm with techniques that jointly improve the training data efficiency. We show the effectiveness of our pre-training paradigm with extensive experiments on both programming and natural language models.
arXiv Detail & Related papers (2023-03-13T17:17:11Z)
Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP) What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining. How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z)
Can Offline Reinforcement Learning Help Natural Language Understanding? [31.788133426611587]
We consider investigating the potential connection between offline reinforcement learning (RL) and language modeling (LM) RL and LM are similar in predicting the next states based on the current and previous states, which rely on both local and long-range dependency across states. Experimental results show that our RL pre-trained models can give close performance compared with the models using the LM training objective.
arXiv Detail & Related papers (2022-09-15T02:55:10Z)
Explain, Edit, and Understand: Rethinking User Study Design for Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews. We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.