Related papers: Language Models Resist Alignment: Evidence From Data Compression

Language Models Resist Alignment: Evidence From Data Compression

URL: http://arxiv.org/abs/2406.06144v4
Date: Wed, 11 Jun 2025 17:23:47 GMT
Title: Language Models Resist Alignment: Evidence From Data Compression
Authors: Jiaming Ji, Kaile Wang, Tianyi Qiu, Boyuan Chen, Jiayi Zhou, Changye Li, Hantao Lou, Juntao Dai, Yunhuai Liu, Yaodong Yang,
Abstract summary: Large language models (LLMs) may exhibit unintended or undesirable behaviors.<n>We show that fine-tuning disproportionately undermines alignment relative to pre-training, potentially by orders of magnitude.<n>Our findings underscore the need to address the inherent elasticity of LLMs to mitigate their resistance to alignment.
Score: 11.208226196119895
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) may exhibit unintended or undesirable behaviors. Recent works have concentrated on aligning LLMs to mitigate harmful outputs. Despite these efforts, some anomalies indicate that even a well-conducted alignment process can be easily circumvented, whether intentionally or accidentally. Does alignment fine-tuning yield have robust effects on models, or are its impacts merely superficial? In this work, we make the first exploration of this phenomenon from both theoretical and empirical perspectives. Empirically, we demonstrate the $\mathbf{elasticity}$ of post-alignment models, i.e., the tendency to revert to the behavior distribution formed during the pre-training phase upon further fine-tuning. Leveraging compression theory, we formally deduce that fine-tuning disproportionately undermines alignment relative to pre-training, potentially by orders of magnitude. We validate the presence of elasticity through experiments on models of varying types and scales. Specifically, we find that model performance declines rapidly before reverting to the pre-training distribution, after which the rate of decline drops significantly. Furthermore, we further reveal that elasticity positively correlates with the increased model size and the expansion of pre-training data. Our findings underscore the need to address the inherent elasticity of LLMs to mitigate their resistance to alignment. The model weight and code are available at pku-lm-resist-alignment.github.io.

Related papers

How Alignment Shrinks the Generative Horizon [20.243063721305116]
Branching Factor (BF) is a token-invariant measure of the effective number of plausible next steps during generation.<n> alignment tuning substantially sharpens the model's output distribution from the outset.<n>Building on this insight, we find this stability has surprising implications for complex reasoning.
arXiv Detail & Related papers (2025-06-22T02:00:37Z)
Mitigating Spurious Correlations in LLMs via Causality-Aware Post-Training [57.03005244917803]
Large language models (LLMs) often fail on out-of-distribution (OOD) samples due to spurious correlations acquired during pre-training.<n>Here, we aim to mitigate such spurious correlations through causality-aware post-training (CAPT)<n> Experiments on the formal causal inference benchmark CLadder and the logical reasoning dataset PrOntoQA show that 3B-scale language models fine-tuned with CAPT can outperform both traditional SFT and larger LLMs on in-distribution (ID) and OOD tasks.
arXiv Detail & Related papers (2025-06-11T06:30:28Z)
When Models Don't Collapse: On the Consistency of Iterative MLE [34.99810116340191]
We study model collapse for maximum likelihood estimation (MLE)<n>We establish non-asymptotic bounds showing that collapse can be avoided even as the fraction of real data vanishes.<n>We prove that some assumptions (beyond MLE consistency) are indeed necessary.
arXiv Detail & Related papers (2025-05-25T08:50:46Z)
Causal Lifting of Neural Representations: Zero-Shot Generalization for Causal Inferences [56.23412698865433]
We focus on causal inferences on a target experiment with unlabeled factual outcomes, retrieved by a predictive model fine-tuned on a labeled similar experiment. First, we show that factual outcome estimation via Empirical Risk Minimization (ERM) may fail to yield valid causal inferences on the target population. We propose Deconfounded Empirical Risk Minimization (DERM), a new simple learning procedure minimizing the risk over a fictitious target population.
arXiv Detail & Related papers (2025-02-10T10:52:17Z)
Strong Model Collapse [16.071600606637908]
We consider a supervised regression setting and establish the existance of a strong form of the model collapse phenomenon. Our results show that even the smallest fraction of synthetic data can lead to model collapse. We investigate whether increasing model size, an approach aligned with current trends in training large language models, exacerbates or mitigates model collapse.
arXiv Detail & Related papers (2024-10-07T08:54:23Z)
Causal Estimation of Memorisation Profiles [58.20086589761273]
Understanding memorisation in language models has practical and societal implications. Memorisation is the causal effect of training with an instance on the model's ability to predict that instance. This paper proposes a new, principled, and efficient method to estimate memorisation based on the difference-in-differences design from econometrics.
arXiv Detail & Related papers (2024-06-06T17:59:09Z)
Low-rank finetuning for LLMs: A fairness perspective [54.13240282850982]
Low-rank approximation techniques have become the de facto standard for fine-tuning Large Language Models. This paper investigates the effectiveness of these methods in capturing the shift of fine-tuning datasets from the initial pre-trained data distribution. We show that low-rank fine-tuning inadvertently preserves undesirable biases and toxic behaviors.
arXiv Detail & Related papers (2024-05-28T20:43:53Z)
Post-Hoc Reversal: Are We Selecting Models Prematurely? [13.910702424593797]
We show a phenomenon that we call post-hoc reversal, where performance trends are reversed after applying post-hoc transforms. Preliminary analyses suggest that these transforms induce reversal by suppressing the influence of mislabeled examples. We propose post-hoc selection, a simple technique whereby post-hoc metrics inform model development decisions.
arXiv Detail & Related papers (2024-04-11T14:58:19Z)
Ask Your Distribution Shift if Pre-Training is Right for You [67.90850628695563]
In practice, fine-tuning a pre-trained model improves robustness significantly in some cases but not at all in others.<n>We focus on two possible failure modes of models under distribution shift: poor extrapolation and biases in the training data.<n>Our study suggests that, as a rule of thumb, pre-training can help mitigate poor extrapolation but not dataset biases.
arXiv Detail & Related papers (2024-02-29T23:46:28Z)
Selective Learning: Towards Robust Calibration with Dynamic Regularization [79.92633587914659]
Miscalibration in deep learning refers to there is a discrepancy between the predicted confidence and performance. We introduce Dynamic Regularization (DReg) which aims to learn what should be learned during training thereby circumventing the confidence adjusting trade-off.
arXiv Detail & Related papers (2024-02-13T11:25:20Z)
Model Collapse Demystified: The Case of Regression [12.115359951879462]
We study the phenomenon of "model collapse" in the era of proliferation of large language and image generation models. We obtain analytic formulae which quantitatively outline this phenomenon in a broad range of regimes. We propose a simple strategy based on adaptive regularization to mitigate model collapse.
arXiv Detail & Related papers (2024-02-12T15:26:01Z)
Selecting Large Language Model to Fine-tune via Rectified Scaling Law [74.84096546112215]
Given constrained resources, fine-tuning all models and making selections afterward is unrealistic. We find that the fine-tuning scaling curve includes not just the well-known "power phase" but also the previously unobserved "pre-power phase" By leveraging our law, we propose a novel LLM selection algorithm that selects the near-optimal model with hundreds of times less resource consumption.
arXiv Detail & Related papers (2024-02-04T01:55:00Z)
On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics. The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z)
A PAC-Bayesian Perspective on the Interpolating Information Criterion [54.548058449535155]
We show how a PAC-Bayes bound is obtained for a general class of models, characterizing factors which influence performance in the interpolating regime. We quantify how the test error for overparameterized models achieving effectively zero training error depends on the quality of the implicit regularization imposed by e.g. the combination of model, parameter-initialization scheme.
arXiv Detail & Related papers (2023-11-13T01:48:08Z)
Bias in Pruned Vision Models: In-Depth Analysis and Countermeasures [93.17009514112702]
Pruning, setting a significant subset of the parameters of a neural network to zero, is one of the most popular methods of model compression. Despite existing evidence for this phenomenon, the relationship between neural network pruning and induced bias is not well-understood.
arXiv Detail & Related papers (2023-04-25T07:42:06Z)
Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z)
Relating Regularization and Generalization through the Intrinsic Dimension of Activations [11.00580615194563]
We show that common regularization techniques uniformly decrease the last-layer ID (LLID) of validation set activations for image classification models. We also examine the LLID over the course of training of models that exhibit grokking.
arXiv Detail & Related papers (2022-11-23T19:00:00Z)
Mitigating Catastrophic Forgetting in Scheduled Sampling with Elastic Weight Consolidation in Neural Machine Translation [15.581515781839656]
Autoregressive models trained with maximum likelihood estimation suffer from exposure bias. We propose using Elastic Weight Consolidation as trade-off between mitigating exposure bias and retaining output quality. Experiments on two IWSLT'14 translation tasks demonstrate that our approach alleviates catastrophic forgetting and significantly improves BLEU.
arXiv Detail & Related papers (2021-09-13T20:37:58Z)
Extreme Memorization via Scale of Initialization [72.78162454173803]
We construct an experimental setup in which changing the scale of initialization strongly impacts the implicit regularization induced by SGD. We find that the extent and manner in which generalization ability is affected depends on the activation and loss function used. In the case of the homogeneous ReLU activation, we show that this behavior can be attributed to the loss function.
arXiv Detail & Related papers (2020-08-31T04:53:11Z)
Elastic weight consolidation for better bias inoculation [24.12790037712358]
Elastic weight consolidation (EWC) allows fine-tuning of models to mitigate biases. EWC dominates standard fine-tuning, yielding models with lower levels of forgetting on the original (biased) dataset.
arXiv Detail & Related papers (2020-04-29T17:45:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.