Related papers: On the Safety of Open-Sourced Large Language Models: Does Alignment Really Prevent Them From Being Misused?

On the Safety of Open-Sourced Large Language Models: Does Alignment Really Prevent Them From Being Misused?

URL: http://arxiv.org/abs/2310.01581v1
Date: Mon, 2 Oct 2023 19:22:01 GMT
Title: On the Safety of Open-Sourced Large Language Models: Does Alignment Really Prevent Them From Being Misused?
Authors: Hangfan Zhang, Zhimeng Guo, Huaisheng Zhu, Bochuan Cao, Lu Lin, Jinyuan Jia, Jinghui Chen, Dinghao Wu
Abstract summary: We show that open-sourced, aligned large language models could be easily misguided to generate undesired content. Our key idea is to directly manipulate the generation process of open-sourced LLMs to misguide it to generate undesired content.
Score: 49.99955642001019
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have achieved unprecedented performance in Natural Language Generation (NLG) tasks. However, many existing studies have shown that they could be misused to generate undesired content. In response, before releasing LLMs for public access, model developers usually align those language models through Supervised Fine-Tuning (SFT) or Reinforcement Learning with Human Feedback (RLHF). Consequently, those aligned large language models refuse to generate undesired content when facing potentially harmful/unethical requests. A natural question is "could alignment really prevent those open-sourced large language models from being misused to generate undesired content?''. In this work, we provide a negative answer to this question. In particular, we show those open-sourced, aligned large language models could be easily misguided to generate undesired content without heavy computations or careful prompt designs. Our key idea is to directly manipulate the generation process of open-sourced LLMs to misguide it to generate undesired content including harmful or biased information and even private data. We evaluate our method on 4 open-sourced LLMs accessible publicly and our finding highlights the need for more advanced mitigation strategies for open-sourced LLMs.

Related papers

Enhancing Code Generation for Low-Resource Languages: No Silver Bullet [55.39571645315926]
Large Language Models (LLMs) rely on large and diverse datasets to learn syntax, semantics, and usage patterns of programming languages. For low-resource languages, the limited availability of such data hampers the models' ability to generalize effectively. We present an empirical study investigating the effectiveness of several approaches for boosting LLMs' performance on low-resource languages.
arXiv Detail & Related papers (2025-01-31T12:23:28Z)
Can adversarial attacks by large language models be attributed? [1.3812010983144802]
Attributing outputs from Large Language Models in adversarial settings presents significant challenges that are likely to grow in importance. We investigate this attribution problem using formal language theory, specifically language identification in the limit as introduced by Gold and extended by Angluin. Our results show that due to the non-identifiability of certain language classes it is theoretically impossible to attribute outputs to specific LLMs with certainty.
arXiv Detail & Related papers (2024-11-12T18:28:57Z)
Leveraging Open-Source Large Language Models for Native Language Identification [1.6267479602370543]
Native Language Identification (NLI) has applications in forensics, marketing, and second language acquisition. This study explores the potential of using open-source generative large language models (LLMs) for NLI.
arXiv Detail & Related papers (2024-09-15T08:14:18Z)
Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models [59.970391602080205]
This study investigates whether such constraints on generation space impact LLMs abilities, including reasoning and domain knowledge comprehension. We evaluate LLMs performance when restricted to adhere to structured formats versus generating free-form responses across various common tasks. We find that stricter format constraints generally lead to greater performance degradation in reasoning tasks.
arXiv Detail & Related papers (2024-08-05T13:08:24Z)
Developing Safe and Responsible Large Language Model : Can We Balance Bias Reduction and Language Understanding in Large Language Models? [2.089112028396727]
This study explores whether Large Language Models can produce safe, unbiased outputs without sacrificing knowledge or comprehension. We introduce the Safe and Responsible Large Language Model (textbfSR$_textLLM$), which has been instruction fine-tuned atop an inherently safe fine-tuned LLM. Experiments reveal that textbfSR$_textLLM$ effectively reduces biases while preserving knowledge integrity.
arXiv Detail & Related papers (2024-04-01T18:10:05Z)
Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space [19.426618259383126]
We propose the embedding space attack, which directly attacks the continuous embedding representation of input tokens. We show that embedding space attacks circumvent model alignments and trigger harmful behaviors more efficiently than discrete attacks or model fine-tuning. Our findings highlight embedding space attacks as an important threat model in open-source LLMs.
arXiv Detail & Related papers (2024-02-14T10:20:03Z)
Supervised Knowledge Makes Large Language Models Better In-context Learners [94.89301696512776]
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering. The challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored. We propose a framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks.
arXiv Detail & Related papers (2023-12-26T07:24:46Z)
Adapting Large Language Models for Content Moderation: Pitfalls in Data Engineering and Supervised Fine-tuning [79.53130089003986]
Large Language Models (LLMs) have become a feasible solution for handling tasks in various domains. In this paper, we introduce how to fine-tune a LLM model that can be privately deployed for content moderation.
arXiv Detail & Related papers (2023-10-05T09:09:44Z)
Universal and Transferable Adversarial Attacks on Aligned Language Models [118.41733208825278]
We propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable.
arXiv Detail & Related papers (2023-07-27T17:49:12Z)
Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars. We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English. Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.