Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs
- URL: http://arxiv.org/abs/2508.00161v1
- Date: Thu, 31 Jul 2025 21:04:12 GMT
- Title: Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs
- Authors: Ziqian Zhong, Aditi Raghunathan,
- Abstract summary: We introduce a new method for understanding, monitoring and controlling fine-tuned large language models (LLMs)<n>We demonstrate that the top singular of the weight difference between a fine-tuned model and its base model correspond to newly acquired behaviors.<n>For backdoored models that bypasses safety mechanisms when a secret trigger is present, our method stops up to 100% of attacks with a false positive rate below 1.2%.
- Score: 14.779177849006963
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The releases of powerful open-weight large language models (LLMs) are often not accompanied by access to their full training data. Existing interpretability methods, particularly those based on activations, often require or assume distributionally similar data. This is a significant limitation when detecting and defending against novel potential threats like backdoors, which are by definition out-of-distribution. In this work, we introduce a new method for understanding, monitoring and controlling fine-tuned LLMs that interprets weights, rather than activations, thereby side stepping the need for data that is distributionally similar to the unknown training data. We demonstrate that the top singular vectors of the weight difference between a fine-tuned model and its base model correspond to newly acquired behaviors. By monitoring the cosine similarity of activations along these directions, we can detect salient behaviors introduced during fine-tuning with high precision. For backdoored models that bypasses safety mechanisms when a secret trigger is present, our method stops up to 100% of attacks with a false positive rate below 1.2%. For models that have undergone unlearning, we detect inference on erased topics with accuracy up to 95.42% and can even steer the model to recover "unlearned" information. Besides monitoring, our method also shows potential for pre-deployment model auditing: by analyzing commercial instruction-tuned models (OLMo, Llama, Qwen), we are able to uncover model-specific fine-tuning focus including marketing strategies and Midjourney prompt generation. Our implementation can be found at https://github.com/fjzzq2002/WeightWatch.
Related papers
- Lie Detector: Unified Backdoor Detection via Cross-Examination Framework [68.45399098884364]
We propose a unified backdoor detection framework in the semi-honest setting.<n>Our method achieves superior detection performance, improving accuracy by 5.4%, 1.6%, and 11.9% over SoTA baselines.<n> Notably, it is the first to effectively detect backdoors in multimodal large language models.
arXiv Detail & Related papers (2025-03-21T06:12:06Z) - Steering Without Side Effects: Improving Post-Deployment Control of Language Models [61.99293520621248]
Language models (LMs) have been shown to behave unexpectedly post-deployment.
We present KL-then-steer (KTS), a technique that decreases the side effects of steering while retaining its benefits.
Our best method prevents 44% of jailbreak attacks compared to the original Llama-2-chat-7B model.
arXiv Detail & Related papers (2024-06-21T01:37:39Z) - Lazy Layers to Make Fine-Tuned Diffusion Models More Traceable [70.77600345240867]
A novel arbitrary-in-arbitrary-out (AIAO) strategy makes watermarks resilient to fine-tuning-based removal.
Unlike the existing methods of designing a backdoor for the input/output space of diffusion models, in our method, we propose to embed the backdoor into the feature space of sampled subpaths.
Our empirical studies on the MS-COCO, AFHQ, LSUN, CUB-200, and DreamBooth datasets confirm the robustness of AIAO.
arXiv Detail & Related papers (2024-05-01T12:03:39Z) - Selective Learning: Towards Robust Calibration with Dynamic Regularization [79.92633587914659]
Miscalibration in deep learning refers to there is a discrepancy between the predicted confidence and performance.
We introduce Dynamic Regularization (DReg) which aims to learn what should be learned during training thereby circumventing the confidence adjusting trade-off.
arXiv Detail & Related papers (2024-02-13T11:25:20Z) - Constraining Representations Yields Models That Know What They Don't
Know [2.729898906885749]
A well-known failure mode of neural networks is that they may confidently return erroneous predictions.
This work presents a novel direction to address these issues in a broad, general manner.
We assign to each class a unique, fixed, randomly-generated binary vector - hereafter called class code.
We train the model so that its cross-depths activation patterns predict the appropriate class code according to the input sample's class.
arXiv Detail & Related papers (2022-08-30T18:28:00Z) - MOVE: Effective and Harmless Ownership Verification via Embedded External Features [104.97541464349581]
We propose an effective and harmless model ownership verification (MOVE) to defend against different types of model stealing simultaneously.<n>We conduct the ownership verification by verifying whether a suspicious model contains the knowledge of defender-specified external features.<n>We then train a meta-classifier to determine whether a model is stolen from the victim.
arXiv Detail & Related papers (2022-08-04T02:22:29Z) - Identifying Backdoor Attacks in Federated Learning via Anomaly Detection [31.197488921578984]
Federated learning is vulnerable to backdoor attacks.
This paper proposes an effective defense against the attack by examining shared model updates.
We demonstrate through extensive analyses that our proposed methods effectively mitigate state-of-the-art backdoor attacks.
arXiv Detail & Related papers (2022-02-09T07:07:42Z) - Monitoring Model Deterioration with Explainable Uncertainty Estimation
via Non-parametric Bootstrap [0.0]
Monitoring machine learning models once they are deployed is challenging.
It is even more challenging to decide when to retrain models in real-case scenarios when labeled data is beyond reach.
In this work, we use non-parametric bootstrapped uncertainty estimates and SHAP values to provide explainable uncertainty estimation.
arXiv Detail & Related papers (2022-01-27T17:23:04Z) - Probing Model Signal-Awareness via Prediction-Preserving Input
Minimization [67.62847721118142]
We evaluate models' ability to capture the correct vulnerability signals to produce their predictions.
We measure the signal awareness of models using a new metric we propose- Signal-aware Recall (SAR)
The results show a sharp drop in the model's Recall from the high 90s to sub-60s with the new metric.
arXiv Detail & Related papers (2020-11-25T20:05:23Z) - Model Assertions for Monitoring and Improving ML Models [26.90089824436192]
We propose a new abstraction, model assertions, that adapts the classical use of program assertions as a way to monitor and improve ML models.
Model assertions are arbitrary functions over a model's input and output that indicate when errors may be occurring.
We propose methods of using model assertions at all stages of ML system deployment, including runtime monitoring, validating labels, and continuously improving ML models.
arXiv Detail & Related papers (2020-03-03T17:49:49Z) - Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods [24.190587751595455]
Weak supervision is a popular method for building machine learning models without relying on ground truth annotations.
Existing approaches use latent variable estimation to model the noisy sources.
We show that for a class of latent variable models highly applicable to weak supervision, we can find a closed-form solution to model parameters.
We use this insight to build FlyingSquid, a weak supervision framework that runs orders of magnitude faster than previous weak supervision approaches.
arXiv Detail & Related papers (2020-02-27T07:51:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.