Guardrail Baselines for Unlearning in LLMs
- URL: http://arxiv.org/abs/2403.03329v3
- Date: Tue, 11 Jun 2024 15:47:39 GMT
- Title: Guardrail Baselines for Unlearning in LLMs
- Authors: Pratiksha Thaker, Yash Maurya, Shengyuan Hu, Zhiwei Steven Wu, Virginia Smith,
- Abstract summary: Finetuning is a promising approach to 'unlearn' concepts from large language models.
We show that guardrail-based approaches such as prompting and filtering can achieve unlearning results comparable to finetuning.
- Score: 33.86316928349476
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent work has demonstrated that finetuning is a promising approach to 'unlearn' concepts from large language models. However, finetuning can be expensive, as it requires both generating a set of examples and running iterations of finetuning to update the model. In this work, we show that simple guardrail-based approaches such as prompting and filtering can achieve unlearning results comparable to finetuning. We recommend that researchers investigate these lightweight baselines when evaluating the performance of more computationally intensive finetuning methods. While we do not claim that methods such as prompting or filtering are universal solutions to the problem of unlearning, our work suggests the need for evaluation metrics that can better separate the power of guardrails vs. finetuning, and highlights scenarios where guardrails expose possible unintended behavior in existing metrics and benchmarks.
Related papers
- Context-aware Prompt Tuning: Advancing In-Context Learning with Adversarial Methods [69.36397993451742]
This work introduces Context-aware Prompt Tuning (CPT), a method inspired by ICL, PT, and adversarial attacks.
We modify specific context tokens, considering the unique structure of input and output formats.
Inspired by adversarial attacks, we adjust the input based on the labels present in the context, focusing on minimizing, rather than maximizing, the loss.
arXiv Detail & Related papers (2024-10-22T17:45:47Z) - Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning [50.26965628047682]
Adapting pre-trained models to open classes is a challenging problem in machine learning.
In this paper, we consider combining the advantages of both and come up with a test-time prompt tuning approach.
Our proposed method outperforms all comparison methods on average considering both base and new classes.
arXiv Detail & Related papers (2024-08-29T12:34:01Z) - Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models [63.36637269634553]
We present a novel method of further improving performance by requiring models to compare multiple reasoning chains.
We find that instruction tuning on DCoT datasets boosts the performance of even smaller, and therefore more accessible, language models.
arXiv Detail & Related papers (2024-07-03T15:01:18Z) - A Semantic-based Layer Freezing Approach to Efficient Fine-Tuning of Language Models [32.178931149612644]
Finetuning language models (LMs) is crucial for adapting the models to downstream data and tasks.
Existing work, such as parameter-efficient finetuning (PEFT), often focuses on textithow to finetune but neglects the issue of textitwhere to finetune
arXiv Detail & Related papers (2024-06-17T17:13:08Z) - Step-On-Feet Tuning: Scaling Self-Alignment of LLMs via Bootstrapping [53.454408491386886]
bootstrapping self-alignment markedly surpasses the single-round approach.
We propose Step-On-Feet Tuning (SOFT) which leverages model's continuously enhanced few-shot ability to boost zero or one-shot performance.
Based on easy-to-hard training recipe, we propose SOFT+ which further boost self-alignment's performance.
arXiv Detail & Related papers (2024-02-12T12:30:42Z) - Steering Large Language Models for Machine Translation with Finetuning
and In-Context Learning [19.290966101497844]
Large language models (LLMs) are a promising avenue for machine translation (MT)
Their effectiveness highly depends on the choice of few-shot examples and they often require extra post-processing due to overgeneration.
We show that adapter-based finetuning with LoRA matches the performance of traditional finetuning while reducing the number of training parameters by a factor of 50.
arXiv Detail & Related papers (2023-10-20T12:29:51Z) - Context-Aware Meta-Learning [52.09326317432577]
We propose a meta-learning algorithm that emulates Large Language Models by learning new visual concepts during inference without fine-tuning.
Our approach exceeds or matches the state-of-the-art algorithm, P>M>F, on 8 out of 11 meta-learning benchmarks.
arXiv Detail & Related papers (2023-10-17T03:35:27Z) - Selecting Informative Contexts Improves Language Model Finetuning [66.26521454263343]
We present a general fine-tuning method that we call information gain filtration.
During fine-tuning, a secondary learner selects informative examples and skips uninformative ones.
We show that our method has consistent improvement across datasets, fine-tuning tasks, and language model architectures.
arXiv Detail & Related papers (2020-05-01T02:01:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.