Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models
- URL: http://arxiv.org/abs/2412.06748v1
- Date: Mon, 09 Dec 2024 18:40:44 GMT
- Title: Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models
- Authors: Neel Jain, Aditya Shrivastava, Chenyang Zhu, Daben Liu, Alfy Samuel, Ashwinee Panda, Anoop Kumar, Micah Goldblum, Tom Goldstein,
- Abstract summary: A key component of building safe and reliable language models is enabling the models to appropriately refuse to answer certain questions.
We propose refusal tokens, one such token for each refusal category or a single refusal token, which are prepended to the model's responses during training.
- Score: 67.6909704128702
- License:
- Abstract: A key component of building safe and reliable language models is enabling the models to appropriately refuse to follow certain instructions or answer certain questions. We may want models to output refusal messages for various categories of user queries, for example, ill-posed questions, instructions for committing illegal acts, or queries which require information past the model's knowledge horizon. Engineering models that refuse to answer such questions is complicated by the fact that an individual may want their model to exhibit varying levels of sensitivity for refusing queries of various categories, and different users may want different refusal rates. The current default approach involves training multiple models with varying proportions of refusal messages from each category to achieve the desired refusal rates, which is computationally expensive and may require training a new model to accommodate each user's desired preference over refusal rates. To address these challenges, we propose refusal tokens, one such token for each refusal category or a single refusal token, which are prepended to the model's responses during training. We then show how to increase or decrease the probability of generating the refusal token for each category during inference to steer the model's refusal behavior. Refusal tokens enable controlling a single model's refusal rates without the need of any further fine-tuning, but only by selectively intervening during generation.
Related papers
- Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation [29.605302471407537]
Training a language model to be both helpful and harmless requires careful calibration of refusal behaviours.
We propose a simple and surgical method for mitigating false refusal in language models via single vector ablation.
Our approach is training-free and model-agnostic, making it useful for mitigating the problem of false refusal in current and future language models.
arXiv Detail & Related papers (2024-10-04T13:25:32Z) - The Art of Saying No: Contextual Noncompliance in Language Models [123.383993700586]
We introduce a comprehensive taxonomy of contextual noncompliance describing when and how models should not comply with user requests.
Our taxonomy spans a wide range of categories including incomplete, unsupported, indeterminate, and humanizing requests.
To test noncompliance capabilities of language models, we use this taxonomy to develop a new evaluation suite of 1000 noncompliance prompts.
arXiv Detail & Related papers (2024-07-02T07:12:51Z) - Earning Extra Performance from Restrictive Feedbacks [41.05874087063763]
We set up a challenge named emphEarning eXtra PerformancE from restriCTive feEDdbacks (EXPECTED) to describe this form of model tuning problems.
The goal of the model provider is to eventually deliver a satisfactory model to the local user(s) by utilizing the feedbacks.
We propose to characterize the geometry of the model performance with regard to model parameters through exploring the parameters' distribution.
arXiv Detail & Related papers (2023-04-28T13:16:54Z) - Quark: Controllable Text Generation with Reinforced Unlearning [68.07749519374089]
Large-scale language models often learn behaviors that are misaligned with user expectations.
We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property.
For unlearning toxicity, negative sentiment, and repetition, our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods.
arXiv Detail & Related papers (2022-05-26T21:11:51Z) - Interpreting Language Models with Contrastive Explanations [99.7035899290924]
Language models must consider various features to predict a token, such as its part of speech, number, tense, or semantics.
Existing explanation methods conflate evidence for all these features into a single explanation, which is less interpretable for human understanding.
We show that contrastive explanations are quantifiably better than non-contrastive explanations in verifying major grammatical phenomena.
arXiv Detail & Related papers (2022-02-21T18:32:24Z) - Explaining Reject Options of Learning Vector Quantization Classifiers [6.125017875330933]
We propose to use counterfactual explanations for explaining rejects in machine learning models.
We investigate how to efficiently compute counterfactual explanations of different reject options for an important class of models.
arXiv Detail & Related papers (2022-02-15T08:16:10Z) - Noisy Channel Language Model Prompting for Few-Shot Text Classification [87.23056864536613]
We introduce a noisy channel approach for language model prompting in few-shot text classification.
Instead of computing the likelihood of the label given the input, channel models compute the conditional probability of the input given the label.
We use channel models for recently proposed few-shot learning methods with no or very limited updates to the language model parameters.
arXiv Detail & Related papers (2021-08-09T15:06:26Z) - Selecting Diverse Models for Scientific Insight [0.12891210250935145]
We show how different penalty settings can promote either shrinkage or sparsity of coefficients in separate models.
A choice of penalty form that enforces variable selection is applied to predict stacking fault energy from steel alloy composition.
arXiv Detail & Related papers (2020-06-16T14:06:55Z) - ManyModalQA: Modality Disambiguation and QA over Diverse Inputs [73.93607719921945]
We present a new multimodal question answering challenge, ManyModalQA, in which an agent must answer a question by considering three distinct modalities.
We collect our data by scraping Wikipedia and then utilize crowdsourcing to collect question-answer pairs.
arXiv Detail & Related papers (2020-01-22T14:39:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.