Related papers: Stress-Testing Capability Elicitation With Password-Locked Models

Stress-Testing Capability Elicitation With Password-Locked Models

URL: http://arxiv.org/abs/2405.19550v1
Date: Wed, 29 May 2024 22:26:26 GMT
Title: Stress-Testing Capability Elicitation With Password-Locked Models
Authors: Ryan Greenblatt, Fabien Roger, Dmitrii Krasheninnikov, David Krueger,
Abstract summary: We investigate the conditions under which fine-tuning-based elicitation suffices to elicit capabilities. We find that a few high-quality demonstrations are often sufficient to fully elicit password-locked capabilities. When only evaluations, and not demonstrations, are available, approaches like reinforcement learning are still often able to elicit capabilities.
Score: 6.6380867311877605
License: http://creativecommons.org/licenses/by/4.0/
Abstract: To determine the safety of large language models (LLMs), AI developers must be able to assess their dangerous capabilities. But simple prompting strategies often fail to elicit an LLM's full capabilities. One way to elicit capabilities more robustly is to fine-tune the LLM to complete the task. In this paper, we investigate the conditions under which fine-tuning-based elicitation suffices to elicit capabilities. To do this, we introduce password-locked models, LLMs fine-tuned such that some of their capabilities are deliberately hidden. Specifically, these LLMs are trained to exhibit these capabilities only when a password is present in the prompt, and to imitate a much weaker LLM otherwise. Password-locked models enable a novel method of evaluating capabilities elicitation methods, by testing whether these password-locked capabilities can be elicited without using the password. We find that a few high-quality demonstrations are often sufficient to fully elicit password-locked capabilities. More surprisingly, fine-tuning can elicit other capabilities that have been locked using the same password, or even different passwords. Furthermore, when only evaluations, and not demonstrations, are available, approaches like reinforcement learning are still often able to elicit capabilities. Overall, our findings suggest that fine-tuning is an effective method of eliciting hidden capabilities of current models, but may be unreliable when high-quality demonstrations are not available, e.g. as may be the case when models' (hidden) capabilities exceed those of human demonstrators.

Related papers

Maybe I Should Not Answer That, but... Do LLMs Understand The Safety of Their Inputs? [0.836362570897926]
We investigate existing methods for such generalization and find them insufficient. To avoid performance degradation and preserve safe performance, we advocate for a two-step framework. We find that the final hidden state for the last token is enough to provide robust performance.
arXiv Detail & Related papers (2025-02-22T10:31:50Z)
The Elicitation Game: Evaluating Capability Elicitation Techniques [1.064108398661507]
We evaluate the effectiveness of capability elicitation techniques by intentionally training model organisms. We introduce a novel method for training model organisms, based on circuit breaking. For a code-generation task, only fine-tuning can elicit the hidden capabilities of our novel model organism.
arXiv Detail & Related papers (2025-02-04T09:54:24Z)
What You See Is Not Always What You Get: An Empirical Study of Code Comprehension by Large Language Models [0.5735035463793009]
We investigate the vulnerability of large language models (LLMs) to imperceptible attacks, where hidden character manipulation in source code misleads LLMs' behaviour while remaining undetectable to human reviewers. These attacks include coding reordering, invisible coding characters, code deletions, and code homoglyphs. Our findings confirm the susceptibility of LLMs to imperceptible coding character attacks, while different LLMs present different negative correlations between perturbation magnitude and performance.
arXiv Detail & Related papers (2024-12-11T04:52:41Z)
Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models [8.024771725860127]
Large Language Models (LLMs) remain vulnerable to jailbreak attacks that bypass their safety mechanisms. We introduce a novel scalable jailbreak attack that preempts the activation of an LLM's safety policies by occupying its computational resources.
arXiv Detail & Related papers (2024-10-05T15:10:01Z)
Zero-to-Strong Generalization: Eliciting Strong Capabilities of Large Language Models Iteratively without Gold Labels [75.77877889764073]
Large Language Models (LLMs) have demonstrated remarkable performance through supervised fine-tuning or in-context learning using gold labels. This study explores whether solely utilizing unlabeled data can elicit strong model capabilities. We propose a new paradigm termed zero-to-strong generalization.
arXiv Detail & Related papers (2024-09-19T02:59:44Z)
Can Reinforcement Learning Unlock the Hidden Dangers in Aligned Large Language Models? [3.258629327038072]
Large Language Models (LLMs) have demonstrated impressive capabilities in natural language tasks. Yet, the potential for generating harmful content through these models seems to persist. This paper explores the concept of jailbreaking LLMs-reversing their alignment through adversarial triggers.
arXiv Detail & Related papers (2024-08-05T17:27:29Z)
Enhancing the Capability and Robustness of Large Language Models through Reinforcement Learning-Driven Query Refinement [32.888016435098045]
The capacity of large language models (LLMs) to generate honest, harmless, and helpful responses heavily relies on the quality of user prompts. This study proposes a transferable and pluggable framework that refines user prompts before they are input into LLMs. This strategy improves the quality of the queries, empowering LLMs to generate more truthful, benign and useful responses.
arXiv Detail & Related papers (2024-07-01T16:55:28Z)
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models [95.09157454599605]
Large Language Models (LLMs) are becoming increasingly powerful, but they still exhibit significant but subtle weaknesses. Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies. We introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks.
arXiv Detail & Related papers (2024-06-24T15:16:45Z)
Rethinking Jailbreaking through the Lens of Representation Engineering [45.70565305714579]
The recent surge in jailbreaking methods has revealed the vulnerability of Large Language Models (LLMs) to malicious inputs. This study investigates the vulnerability of safety-aligned LLMs by uncovering specific activity patterns.
arXiv Detail & Related papers (2024-01-12T00:50:04Z)
A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly [21.536079040559517]
Large Language Models (LLMs) have revolutionized natural language understanding and generation. This paper explores the intersection of LLMs with security and privacy.
arXiv Detail & Related papers (2023-12-04T16:25:18Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
PassGPT: Password Modeling and (Guided) Generation with Large Language Models [59.11160990637616]
We present PassGPT, a large language model trained on password leaks for password generation. We also introduce the concept of guided password generation, where we leverage PassGPT sampling procedure to generate passwords matching arbitrary constraints.
arXiv Detail & Related papers (2023-06-02T13:49:53Z)
Red Teaming Language Model Detectors with Language Models [114.36392560711022]
Large language models (LLMs) present significant safety and ethical risks if exploited by malicious users. Recent works have proposed algorithms to detect LLM-generated text and protect LLMs. We study two types of attack strategies: 1) replacing certain words in an LLM's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation.
arXiv Detail & Related papers (2023-05-31T10:08:37Z)
Self-Prompting Large Language Models for Zero-Shot Open-Domain QA [67.08732962244301]
Open-Domain Question Answering (ODQA) aims to answer questions without explicitly providing background documents. This task becomes notably challenging in a zero-shot setting where no data is available to train tailored retrieval-reader models. We propose a Self-Prompting framework to explicitly utilize the massive knowledge encoded in the parameters of Large Language Models.
arXiv Detail & Related papers (2022-12-16T18:23:43Z)
PromptAttack: Prompt-based Attack for Language Models via Gradient Search [24.42194796252163]
We observe that the prompt learning methods are vulnerable and can easily be attacked by some illegally constructed prompts. In this paper, we propose a malicious prompt template construction method (textbfPromptAttack) to probe the security performance of PLMs.
arXiv Detail & Related papers (2022-09-05T10:28:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.