Rethinking harmless refusals when fine-tuning foundation models
- URL: http://arxiv.org/abs/2406.19552v1
- Date: Thu, 27 Jun 2024 22:08:22 GMT
- Title: Rethinking harmless refusals when fine-tuning foundation models
- Authors: Florin Pop, Judd Rosenblatt, Diogo Schwerz de Lucena, Michael Vaiana,
- Abstract summary: We investigate the degree to which fine-tuning in Large Language Models (LLMs) effectively mitigates versus merely conceals undesirable behavior.
We identify a pervasive phenomenon we term emphreason-based deception, where models either stop producing reasoning traces or produce seemingly ethical reasoning traces that belie the unethical nature of their final outputs.
- Score: 0.8571111167616167
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we investigate the degree to which fine-tuning in Large Language Models (LLMs) effectively mitigates versus merely conceals undesirable behavior. Through the lens of semi-realistic role-playing exercises designed to elicit such behaviors, we explore the response dynamics of LLMs post fine-tuning interventions. Our methodology involves prompting models for Chain-of-Thought (CoT) reasoning and analyzing the coherence between the reasoning traces and the resultant outputs. Notably, we identify a pervasive phenomenon we term \emph{reason-based deception}, where models either stop producing reasoning traces or produce seemingly ethical reasoning traces that belie the unethical nature of their final outputs. We further examine the efficacy of response strategies (polite refusal versus explicit rebuttal) in curbing the occurrence of undesired behavior in subsequent outputs of multi-turn interactions. Our findings reveal that explicit rebuttals significantly outperform polite refusals in preventing the continuation of undesired outputs and nearly eliminate reason-based deception, challenging current practices in model fine-tuning. Accordingly, the two key contributions of this paper are (1) defining and studying reason-based deception, a new type of hidden behavior, and (2) demonstrating that rebuttals provide a more robust response model to harmful requests than refusals, thereby highlighting the need to reconsider the response strategies in fine-tuning approaches.
Related papers
- Preemptive Answer "Attacks" on Chain-of-Thought Reasoning [7.233752893356647]
Large language models (LLMs) showcase impressive reasoning capabilities when coupled with Chain-of-Thought prompting.
In this paper, we introduce a novel scenario termed preemptive answers, where the LLM obtains an answer before engaging in reasoning.
Experiments reveal that preemptive answers significantly impair the model's reasoning capability across various CoT methods and a broad spectrum of datasets.
arXiv Detail & Related papers (2024-05-31T15:15:04Z) - Estimating the Causal Effects of Natural Logic Features in Transformer-Based NLI Models [16.328341121232484]
We apply causal effect estimation strategies to measure the effect of context interventions.
We investigate robustness to irrelevant changes and sensitivity to impactful changes of Transformers.
arXiv Detail & Related papers (2024-04-03T10:22:35Z) - Limitations of Agents Simulated by Predictive Models [1.6649383443094403]
We outline two structural reasons for why predictive models can fail when turned into agents.
We show that both of those failures are fixed by including a feedback loop from the environment.
Our treatment provides a unifying view of those failure modes, and informs the question of why fine-tuning offline learned policies with online learning makes them more effective.
arXiv Detail & Related papers (2024-02-08T17:08:08Z) - Interpretable Imitation Learning with Dynamic Causal Relations [65.18456572421702]
We propose to expose captured knowledge in the form of a directed acyclic causal graph.
We also design this causal discovery process to be state-dependent, enabling it to model the dynamics in latent causal graphs.
The proposed framework is composed of three parts: a dynamic causal discovery module, a causality encoding module, and a prediction module, and is trained in an end-to-end manner.
arXiv Detail & Related papers (2023-09-30T20:59:42Z) - Advancing Counterfactual Inference through Nonlinear Quantile Regression [77.28323341329461]
We propose a framework for efficient and effective counterfactual inference implemented with neural networks.
The proposed approach enhances the capacity to generalize estimated counterfactual outcomes to unseen data.
Empirical results conducted on multiple datasets offer compelling support for our theoretical assertions.
arXiv Detail & Related papers (2023-06-09T08:30:51Z) - Analyzing Semantic Faithfulness of Language Models via Input
Intervention on Question Answering [4.799822253865053]
We formalize a notion of semantic faithfulness, in which the semantic content of a text should causally figure in a model's inferences in question answering.
We show that transformer models fail to be semantically faithful once we perform two semantic interventions: deletion intervention and negation intervention.
We propose an intervention-based training regime that can mitigate the undesirable effects for deletion intervention by a significant margin.
arXiv Detail & Related papers (2022-12-21T00:00:01Z) - Fairness Increases Adversarial Vulnerability [50.90773979394264]
This paper shows the existence of a dichotomy between fairness and robustness, and analyzes when achieving fairness decreases the model robustness to adversarial samples.
Experiments on non-linear models and different architectures validate the theoretical findings in multiple vision domains.
The paper proposes a simple, yet effective, solution to construct models achieving good tradeoffs between fairness and robustness.
arXiv Detail & Related papers (2022-11-21T19:55:35Z) - Logical Satisfiability of Counterfactuals for Faithful Explanations in
NLI [60.142926537264714]
We introduce the methodology of Faithfulness-through-Counterfactuals.
It generates a counterfactual hypothesis based on the logical predicates expressed in the explanation.
It then evaluates if the model's prediction on the counterfactual is consistent with that expressed logic.
arXiv Detail & Related papers (2022-05-25T03:40:59Z) - Towards Robust and Adaptive Motion Forecasting: A Causal Representation
Perspective [72.55093886515824]
We introduce a causal formalism of motion forecasting, which casts the problem as a dynamic process with three groups of latent variables.
We devise a modular architecture that factorizes the representations of invariant mechanisms and style confounders to approximate a causal graph.
Experiment results on synthetic and real datasets show that our three proposed components significantly improve the robustness and reusability of the learned motion representations.
arXiv Detail & Related papers (2021-11-29T18:59:09Z) - Towards Interpretable Reasoning over Paragraph Effects in Situation [126.65672196760345]
We focus on the task of reasoning over paragraph effects in situation, which requires a model to understand the cause and effect.
We propose a sequential approach for this task which explicitly models each step of the reasoning process with neural network modules.
In particular, five reasoning modules are designed and learned in an end-to-end manner, which leads to a more interpretable model.
arXiv Detail & Related papers (2020-10-03T04:03:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.