On the Safety of Open-Sourced Large Language Models: Does Alignment
Really Prevent Them From Being Misused?
- URL: http://arxiv.org/abs/2310.01581v1
- Date: Mon, 2 Oct 2023 19:22:01 GMT
- Title: On the Safety of Open-Sourced Large Language Models: Does Alignment
Really Prevent Them From Being Misused?
- Authors: Hangfan Zhang, Zhimeng Guo, Huaisheng Zhu, Bochuan Cao, Lu Lin,
Jinyuan Jia, Jinghui Chen, Dinghao Wu
- Abstract summary: We show that open-sourced, aligned large language models could be easily misguided to generate undesired content.
Our key idea is to directly manipulate the generation process of open-sourced LLMs to misguide it to generate undesired content.
- Score: 49.99955642001019
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have achieved unprecedented performance in
Natural Language Generation (NLG) tasks. However, many existing studies have
shown that they could be misused to generate undesired content. In response,
before releasing LLMs for public access, model developers usually align those
language models through Supervised Fine-Tuning (SFT) or Reinforcement Learning
with Human Feedback (RLHF). Consequently, those aligned large language models
refuse to generate undesired content when facing potentially harmful/unethical
requests. A natural question is "could alignment really prevent those
open-sourced large language models from being misused to generate undesired
content?''. In this work, we provide a negative answer to this question. In
particular, we show those open-sourced, aligned large language models could be
easily misguided to generate undesired content without heavy computations or
careful prompt designs. Our key idea is to directly manipulate the generation
process of open-sourced LLMs to misguide it to generate undesired content
including harmful or biased information and even private data. We evaluate our
method on 4 open-sourced LLMs accessible publicly and our finding highlights
the need for more advanced mitigation strategies for open-sourced LLMs.
Related papers
- Perplexed: Understanding When Large Language Models are Confused [3.4208414448496027]
This paper introduces perplexed, a library for exploring where a language model is perplexed.
We conducted a case study focused on Large Language Models (LLMs) for code generation using an additional tool we built to help with the analysis of code models called codetokenizer.
We found that our studied code LLMs had their worst performance on coding structures where the code was not syntactically correct.
arXiv Detail & Related papers (2024-04-09T22:03:39Z) - Safe and Responsible Large Language Model : Can We Balance Bias Reduction and Language Understanding in Large Language Models? [2.089112028396727]
Current approaches to produce unbiased outputs from Large Language Models can reduce biases but at the expense of knowledge retention.
We develop the Safety and Responsible Large Language Model (textbfSR$_textLLM$) to diminish biases in generated text.
The results confirm that textbfSR$textLLM$ outperforms traditional fine-tuning and prompting methods in both reducing biases and preserving the integrity of language knowledge.
arXiv Detail & Related papers (2024-04-01T18:10:05Z) - Unmemorization in Large Language Models via Self-Distillation and
Deliberate Imagination [58.36408867180233]
Large Language Models (LLMs) struggle with crucial issues of privacy violation and unwanted exposure of sensitive data.
We introduce a novel approach termed deliberate imagination in the context of LLM unlearning.
Our results demonstrate the usefulness of this approach across different models and sizes, and also with parameter-efficient fine-tuning.
arXiv Detail & Related papers (2024-02-15T16:21:14Z) - Soft Prompt Threats: Attacking Safety Alignment and Unlearning in
Open-Source LLMs through the Embedding Space [19.426618259383126]
We propose the embedding space attack, which directly attacks the continuous embedding representation of input tokens.
We show that embedding space attacks circumvent model alignments and trigger harmful behaviors more efficiently than discrete attacks or model fine-tuning.
Our findings highlight embedding space attacks as an important threat model in open-source LLMs.
arXiv Detail & Related papers (2024-02-14T10:20:03Z) - Supervised Knowledge Makes Large Language Models Better In-context Learners [94.89301696512776]
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering.
The challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored.
We propose a framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks.
arXiv Detail & Related papers (2023-12-26T07:24:46Z) - Make Them Spill the Beans! Coercive Knowledge Extraction from
(Production) LLMs [31.80386572346993]
We exploit the fact that even when an LLM rejects a toxic request, a harmful response often hides deep in the output logits.
This approach differs from and outperforms jail-breaking methods, achieving 92% effectiveness compared to 62%, and is 10 to 20 times faster.
Our findings indicate that interrogation can extract toxic knowledge even from models specifically designed for coding tasks.
arXiv Detail & Related papers (2023-12-08T01:41:36Z) - Adapting Large Language Models for Content Moderation: Pitfalls in Data
Engineering and Supervised Fine-tuning [79.53130089003986]
Large Language Models (LLMs) have become a feasible solution for handling tasks in various domains.
In this paper, we introduce how to fine-tune a LLM model that can be privately deployed for content moderation.
arXiv Detail & Related papers (2023-10-05T09:09:44Z) - Universal and Transferable Adversarial Attacks on Aligned Language
Models [118.41733208825278]
We propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors.
Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable.
arXiv Detail & Related papers (2023-07-27T17:49:12Z) - Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.