Nevermind: Instruction Override and Moderation in Large Language Models
- URL: http://arxiv.org/abs/2402.03303v1
- Date: Mon, 5 Feb 2024 18:58:19 GMT
- Title: Nevermind: Instruction Override and Moderation in Large Language Models
- Authors: Edward Kim
- Abstract summary: We investigate and benchmark the most popular proprietary and different sized open source models on the task of explicit instruction following in conflicting situations.
We observe improving instruction following, and subsequently instruction overrides/jailbreaks, is fundamentally at odds with the ability of a language model to follow given safety filters or guidelines.
- Score: 2.0935496890864207
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Given the impressive capabilities of recent Large Language Models (LLMs), we
investigate and benchmark the most popular proprietary and different sized open
source models on the task of explicit instruction following in conflicting
situations, e.g. overrides. These include the ability of the model to override
the knowledge within the weights of the model, the ability to override (or
moderate) extracted knowledge in the prompt, and lastly the ability to perform
a full jailbreak. Experimentation performed suggest several key findings to
improve instruction following - larger models perform the best in following
instructions that override internal and contextual instructions, and are
obedient, even to a fault. When scaling to longer contexts via rope scaling, a
significant buffer needs to be maintained from the edge of the perplexity cliff
in order to maintain instruction following capabilities. Finally, we observe
improving instruction following, and subsequently instruction
overrides/jailbreaks, is fundamentally at odds with the ability of a language
model to follow given safety filters or guidelines. Thus, we postulate the most
effective approach for safe, trustworthy AI should be dealt external to the LLM
itself.
Related papers
- Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learning [84.94709351266557]
We focus on the trustworthiness of language models with respect to retrieval augmentation.
We deem that retrieval-augmented language models have the inherent capabilities of supplying response according to both contextual and parametric knowledge.
Inspired by aligning language models with human preference, we take the first step towards aligning retrieval-augmented language models to a status where it responds relying merely on the external evidence.
arXiv Detail & Related papers (2024-10-22T09:25:21Z) - Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy [53.54777131440989]
Large Language Models (LLMs) are susceptible to security and safety threats.
One major cause of these vulnerabilities is the lack of an instruction hierarchy.
We introduce the instructional segment Embedding (ISE) technique, inspired by BERT, to modern large language models.
arXiv Detail & Related papers (2024-10-09T12:52:41Z) - Can Reinforcement Learning Unlock the Hidden Dangers in Aligned Large Language Models? [3.258629327038072]
Large Language Models (LLMs) have demonstrated impressive capabilities in natural language tasks.
Yet, the potential for generating harmful content through these models seems to persist.
This paper explores the concept of jailbreaking LLMs-reversing their alignment through adversarial triggers.
arXiv Detail & Related papers (2024-08-05T17:27:29Z) - Adversarial Contrastive Decoding: Boosting Safety Alignment of Large Language Models via Opposite Prompt Optimization [34.29833630422768]
Adversarial Contrastive Decoding (ACD) is an optimization-based framework to generate two opposite system prompts for prompt-based contrastive decoding.
ACD achieves much better safety performance than previous model training-free decoding methods without sacrificing original generation ability.
arXiv Detail & Related papers (2024-06-24T15:51:30Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z) - Evaluating the Instruction-Following Robustness of Large Language Models
to Prompt Injection [70.28425745910711]
Large Language Models (LLMs) have demonstrated exceptional proficiency in instruction-following.
This capability brings with it the risk of prompt injection attacks.
We evaluate the robustness of instruction-following LLMs against such attacks.
arXiv Detail & Related papers (2023-08-17T06:21:50Z) - Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output
Robustness of Large Language Models [28.37026309925163]
Large language models (LLMs) are designed to align with human values and generate safe text.
Previous benchmarks for jailbreaking LLMs have primarily focused on evaluating the safety of the models.
This paper assesses both the safety and robustness of LLMs, emphasizing the need for a balanced approach.
arXiv Detail & Related papers (2023-07-17T13:49:52Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.