Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing
- URL: http://arxiv.org/abs/2407.08770v1
- Date: Thu, 11 Jul 2024 17:52:03 GMT
- Title: Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing
- Authors: Huanqian Wang, Yang Yue, Rui Lu, Jingxin Shi, Andrew Zhao, Shenzhi Wang, Shiji Song, Gao Huang,
- Abstract summary: Large Language Models (LLMs) have demonstrated great potential as generalist assistants.
It is crucial that these models exhibit desirable behavioral traits, such as non-toxicity and resilience against jailbreak attempts.
In this paper, we observe that directly editing a small subset of parameters can effectively modulate specific behaviors of LLMs.
- Score: 63.20133320524577
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have demonstrated great potential as generalist assistants, showcasing powerful task understanding and problem-solving capabilities. To deploy LLMs as AI assistants, it is crucial that these models exhibit desirable behavioral traits, such as non-toxicity and resilience against jailbreak attempts. Current methods for detoxification or preventing jailbreaking usually involve Supervised Fine-Tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF), which requires finetuning billions of parameters through gradient descent with substantial computation cost. Furthermore, models modified through SFT and RLHF may deviate from the pretrained models, potentially leading to a degradation in foundational LLM capabilities. In this paper, we observe that surprisingly, directly editing a small subset of parameters can effectively modulate specific behaviors of LLMs, such as detoxification and resistance to jailbreaking. Specifically, for a behavior that we aim to avoid, we employ a linear classifier, which we term the behavior probe, to classify binary behavior labels within the hidden state space of the LLM. Using this probe, we introduce an algorithm to identify a critical subset of LLM parameters that significantly influence this targeted behavior. Then we directly edit these selected parameters by shifting them towards the behavior probe. Such a direct parameter editing method necessitates only inference-level computational resources. Experiments demonstrate that in the representative detoxification task, our approach achieves reductions of up to 90.0\% in toxicity on the RealToxicityPrompts dataset and 49.2\% on ToxiGen, while maintaining the LLM's general capabilities in areas such as common sense, question answering, and mathematics. Our code is available at https://github.com/lucywang720/model-surgery.
Related papers
- Gradient-Mask Tuning Elevates the Upper Limits of LLM Performance [51.36243421001282]
Gradient-Mask Tuning (GMT) is a method that selectively updates parameters during training based on their gradient information.
Our empirical results across various tasks demonstrate that GMT not only outperforms traditional fine-tuning methods but also elevates the upper limits of LLM performance.
arXiv Detail & Related papers (2024-06-21T17:42:52Z) - Are you still on track!? Catching LLM Task Drift with Activations [55.75645403965326]
Large Language Models (LLMs) are routinely used in retrieval-augmented applications to orchestrate tasks and process inputs from users and other sources.
This opens the door to prompt injection attacks, where the LLM receives and acts upon instructions from supposedly data-only sources, thus deviating from the user's original instructions.
We define this as task drift, and we propose to catch it by scanning and analyzing the LLM's activations.
We show that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions, without being trained on any of these attacks.
arXiv Detail & Related papers (2024-06-02T16:53:21Z) - One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models [67.49462724595445]
Retrieval-augmented generation (RAG) is a promising way to improve large language models (LLMs)
We propose a novel method that involves learning scalable and pluggable virtual tokens for RAG.
arXiv Detail & Related papers (2024-05-30T03:44:54Z) - Efficient Reinforcement Learning via Large Language Model-based Search [27.307583105810895]
Large Language Models (LLMs) have rapidly gained prominence across a magnitude of natural language tasks.
We propose MEDIC: a framework that augments LLMs with a Model-based feEDback critIC to generate a possibly sub-optimal but valid plan for an abstract problem.
Our experiments show 1) the effectiveness of augmenting LLMs with MEDIC, 2) a significant improvement in the sample complexity of PPO and A2C-based RL agents when guided by our LLM-generated plan, and 3) pave the direction for further explorations of how these models can be used.
arXiv Detail & Related papers (2024-05-24T03:53:57Z) - Do LLM Agents Have Regret? A Case Study in Online Learning and Games [30.377709765198592]
Large language models (LLMs) have been increasingly employed for (interactive) decision-making.
We study their interactions in benchmark decision-making settings in online learning and game theory.
We propose a novel emphun training loss of emphregret-loss, which, in contrast to the supervised pre-training loss, does not require the labels of (supervised) actions.
arXiv Detail & Related papers (2024-03-25T15:04:11Z) - Detoxifying Large Language Models via Knowledge Editing [57.0669577257301]
This paper investigates using knowledge editing techniques to detoxify Large Language Models (LLMs)
We construct a benchmark, SafeEdit, which covers nine unsafe categories with various powerful attack prompts.
We conduct experiments with several knowledge editing approaches, indicating that knowledge editing has the potential to detoxify LLMs with a limited impact on general performance efficiently.
arXiv Detail & Related papers (2024-03-21T15:18:30Z) - Human-Readable Fingerprint for Large Language Models [47.952699246648045]
We introduce a human-readable fingerprint for large language models (LLMs)
Our method generates a dog image as an identity fingerprint for an LLM, where the dog's appearance strongly indicates the LLM's base model.
arXiv Detail & Related papers (2023-12-08T05:01:47Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.