Can Editing LLMs Inject Harm?
- URL: http://arxiv.org/abs/2407.20224v3
- Date: Fri, 16 Aug 2024 19:43:03 GMT
- Title: Can Editing LLMs Inject Harm?
- Authors: Canyu Chen, Baixiang Huang, Zekun Li, Zhaorun Chen, Shiyang Lai, Xiongxiao Xu, Jia-Chen Gu, Jindong Gu, Huaxiu Yao, Chaowei Xiao, Xifeng Yan, William Yang Wang, Philip Torr, Dawn Song, Kai Shu,
- Abstract summary: We propose to reformulate knowledge editing as a new type of safety threat for Large Language Models.
For the risk of misinformation injection, we first categorize it into commonsense misinformation injection and long-tail misinformation injection.
For the risk of bias injection, we discover that not only can biased sentences be injected into LLMs with high effectiveness, but also one single biased sentence injection can cause a bias increase.
- Score: 122.83469484328465
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge editing has been increasingly adopted to correct the false or outdated knowledge in Large Language Models (LLMs). Meanwhile, one critical but under-explored question is: can knowledge editing be used to inject harm into LLMs? In this paper, we propose to reformulate knowledge editing as a new type of safety threat for LLMs, namely Editing Attack, and conduct a systematic investigation with a newly constructed dataset EditAttack. Specifically, we focus on two typical safety risks of Editing Attack including Misinformation Injection and Bias Injection. For the risk of misinformation injection, we first categorize it into commonsense misinformation injection and long-tail misinformation injection. Then, we find that editing attacks can inject both types of misinformation into LLMs, and the effectiveness is particularly high for commonsense misinformation injection. For the risk of bias injection, we discover that not only can biased sentences be injected into LLMs with high effectiveness, but also one single biased sentence injection can cause a bias increase in general outputs of LLMs, which are even highly irrelevant to the injected sentence, indicating a catastrophic impact on the overall fairness of LLMs. Then, we further illustrate the high stealthiness of editing attacks, measured by their impact on the general knowledge and reasoning capacities of LLMs, and show the hardness of defending editing attacks with empirical evidence. Our discoveries demonstrate the emerging misuse risks of knowledge editing techniques on compromising the safety alignment of LLMs and the feasibility of disseminating misinformation or bias with LLMs as new channels.
Related papers
- HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router [42.222681564769076]
We introduce HiddenGuard, a novel framework for fine-grained, safe generation in Large Language Models.
HiddenGuard incorporates Prism, which operates alongside the LLM to enable real-time, token-level detection and redaction of harmful content.
Our experiments demonstrate that HiddenGuard achieves over 90% in F1 score for detecting and redacting harmful content.
arXiv Detail & Related papers (2024-10-03T17:10:41Z) - Understanding Privacy Risks of Embeddings Induced by Large Language Models [75.96257812857554]
Large language models show early signs of artificial general intelligence but struggle with hallucinations.
One promising solution is to store external knowledge as embeddings, aiding LLMs in retrieval-augmented generation.
Recent studies experimentally showed that the original text can be partially reconstructed from text embeddings by pre-trained language models.
arXiv Detail & Related papers (2024-04-25T13:10:48Z) - See the Unseen: Better Context-Consistent Knowledge-Editing by Noises [73.54237379082795]
Knowledge-editing updates knowledge of large language models (LLMs)
Existing works ignore this property and the editing lacks generalization.
We empirically find that the effects of different contexts upon LLMs in recalling the same knowledge follow a Gaussian-like distribution.
arXiv Detail & Related papers (2024-01-15T09:09:14Z) - Disinformation Capabilities of Large Language Models [0.564232659769944]
This paper presents a study of the disinformation capabilities of the current generation of large language models (LLMs)
We evaluated the capabilities of 10 LLMs using 20 disinformation narratives.
We conclude that LLMs are able to generate convincing news articles that agree with dangerous disinformation narratives.
arXiv Detail & Related papers (2023-11-15T10:25:30Z) - Combating Misinformation in the Age of LLMs: Opportunities and
Challenges [21.712051537924136]
The emergence of Large Language Models (LLMs) has great potential to reshape the landscape of combating misinformation.
On the one hand, LLMs bring promising opportunities for combating misinformation due to their profound world knowledge and strong reasoning abilities.
On the other hand, the critical challenge is that LLMs can be easily leveraged to generate deceptive misinformation at scale.
arXiv Detail & Related papers (2023-11-09T00:05:27Z) - Unveiling the Pitfalls of Knowledge Editing for Large Language Models [41.83423510576848]
It is still unclear whether knowledge editing might introduce side effects that pose potential risks or not.
This paper pioneers the investigation into the potential pitfalls associated with knowledge editing for Large Language Models.
Experimental results vividly demonstrate that knowledge editing might inadvertently cast a shadow of unintended consequences.
arXiv Detail & Related papers (2023-10-03T15:10:46Z) - Can LLM-Generated Misinformation Be Detected? [18.378744138365537]
Large Language Models (LLMs) can be exploited to generate misinformation.
A fundamental research question is: will LLM-generated misinformation cause more harm than human-written misinformation?
arXiv Detail & Related papers (2023-09-25T00:45:07Z) - Eva-KELLM: A New Benchmark for Evaluating Knowledge Editing of LLMs [54.22416829200613]
Eva-KELLM is a new benchmark for evaluating knowledge editing of large language models.
Experimental results indicate that the current methods for knowledge editing using raw documents are not effective in yielding satisfactory results.
arXiv Detail & Related papers (2023-08-19T09:17:19Z) - On the Risk of Misinformation Pollution with Large Language Models [127.1107824751703]
We investigate the potential misuse of modern Large Language Models (LLMs) for generating credible-sounding misinformation.
Our study reveals that LLMs can act as effective misinformation generators, leading to a significant degradation in the performance of Open-Domain Question Answering (ODQA) systems.
arXiv Detail & Related papers (2023-05-23T04:10:26Z) - Can LMs Learn New Entities from Descriptions? Challenges in Propagating
Injected Knowledge [72.63368052592004]
We study LMs' abilities to make inferences based on injected facts (or propagate those facts)
We find that existing methods for updating knowledge show little propagation of injected knowledge.
Yet, prepending entity definitions in an LM's context improves performance across all settings.
arXiv Detail & Related papers (2023-05-02T17:59:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.