Related papers: Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models

Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models

URL: http://arxiv.org/abs/2412.06748v1
Date: Mon, 09 Dec 2024 18:40:44 GMT
Title: Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models
Authors: Neel Jain, Aditya Shrivastava, Chenyang Zhu, Daben Liu, Alfy Samuel, Ashwinee Panda, Anoop Kumar, Micah Goldblum, Tom Goldstein,
Abstract summary: A key component of building safe and reliable language models is enabling the models to appropriately refuse to answer certain questions.<n>We propose refusal tokens, one such token for each refusal category or a single refusal token, which are prepended to the model's responses during training.
Score: 67.6909704128702
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A key component of building safe and reliable language models is enabling the models to appropriately refuse to follow certain instructions or answer certain questions. We may want models to output refusal messages for various categories of user queries, for example, ill-posed questions, instructions for committing illegal acts, or queries which require information past the model's knowledge horizon. Engineering models that refuse to answer such questions is complicated by the fact that an individual may want their model to exhibit varying levels of sensitivity for refusing queries of various categories, and different users may want different refusal rates. The current default approach involves training multiple models with varying proportions of refusal messages from each category to achieve the desired refusal rates, which is computationally expensive and may require training a new model to accommodate each user's desired preference over refusal rates. To address these challenges, we propose refusal tokens, one such token for each refusal category or a single refusal token, which are prepended to the model's responses during training. We then show how to increase or decrease the probability of generating the refusal token for each category during inference to steer the model's refusal behavior. Refusal tokens enable controlling a single model's refusal rates without the need of any further fine-tuning, but only by selectively intervening during generation.

Related papers

Why do zeroes happen? A model-based approach for demand classification [0.0]
We propose a two-stage model-based classification framework that identifies artificially occurring zeroes. We then argue that different types of demand need different features, and show empirically that they tend to increase the accuracy of the forecasting methods.
arXiv Detail & Related papers (2025-04-08T10:45:30Z)
Answer, Assemble, Ace: Understanding How LMs Answer Multiple Choice Questions [103.20281438405111]
Multiple-choice question answering (MCQA) is a key competence of performant transformer language models. We employ vocabulary projection and activation patching methods to localize key hidden states that encode relevant information for predicting the correct answer. We show that subsequent layers increase the probability of the predicted answer symbol in vocabulary space, and that this probability increase is associated with a sparse set of attention heads with unique roles.
arXiv Detail & Related papers (2024-07-21T00:10:23Z)
The Art of Saying No: Contextual Noncompliance in Language Models [123.383993700586]
We introduce a comprehensive taxonomy of contextual noncompliance describing when and how models should not comply with user requests. Our taxonomy spans a wide range of categories including incomplete, unsupported, indeterminate, and humanizing requests. To test noncompliance capabilities of language models, we use this taxonomy to develop a new evaluation suite of 1000 noncompliance prompts.
arXiv Detail & Related papers (2024-07-02T07:12:51Z)
Earning Extra Performance from Restrictive Feedbacks [41.05874087063763]
We set up a challenge named emphEarning eXtra PerformancE from restriCTive feEDdbacks (EXPECTED) to describe this form of model tuning problems. The goal of the model provider is to eventually deliver a satisfactory model to the local user(s) by utilizing the feedbacks. We propose to characterize the geometry of the model performance with regard to model parameters through exploring the parameters' distribution.
arXiv Detail & Related papers (2023-04-28T13:16:54Z)
Quark: Controllable Text Generation with Reinforced Unlearning [68.07749519374089]
Large-scale language models often learn behaviors that are misaligned with user expectations. We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property. For unlearning toxicity, negative sentiment, and repetition, our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods.
arXiv Detail & Related papers (2022-05-26T21:11:51Z)
Explaining Reject Options of Learning Vector Quantization Classifiers [6.125017875330933]
We propose to use counterfactual explanations for explaining rejects in machine learning models. We investigate how to efficiently compute counterfactual explanations of different reject options for an important class of models.
arXiv Detail & Related papers (2022-02-15T08:16:10Z)
Noisy Channel Language Model Prompting for Few-Shot Text Classification [87.23056864536613]
We introduce a noisy channel approach for language model prompting in few-shot text classification. Instead of computing the likelihood of the label given the input, channel models compute the conditional probability of the input given the label. We use channel models for recently proposed few-shot learning methods with no or very limited updates to the language model parameters.
arXiv Detail & Related papers (2021-08-09T15:06:26Z)
Selecting Diverse Models for Scientific Insight [0.12891210250935145]
We show how different penalty settings can promote either shrinkage or sparsity of coefficients in separate models. A choice of penalty form that enforces variable selection is applied to predict stacking fault energy from steel alloy composition.
arXiv Detail & Related papers (2020-06-16T14:06:55Z)
ManyModalQA: Modality Disambiguation and QA over Diverse Inputs [73.93607719921945]
We present a new multimodal question answering challenge, ManyModalQA, in which an agent must answer a question by considering three distinct modalities. We collect our data by scraping Wikipedia and then utilize crowdsourcing to collect question-answer pairs.
arXiv Detail & Related papers (2020-01-22T14:39:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.