Position: Editing Large Language Models Poses Serious Safety Risks
- URL: http://arxiv.org/abs/2502.02958v1
- Date: Wed, 05 Feb 2025 07:51:32 GMT
- Title: Position: Editing Large Language Models Poses Serious Safety Risks
- Authors: Paul Youssef, Zhixue Zhao, Daniel Braun, Jörg Schlötterer, Christin Seifert,
- Abstract summary: We argue that editing Large Language Models poses serious safety risks that have been largely overlooked.<n>We highlight vulnerabilities in the AI ecosystem that allow unrestricted uploading and downloading of updated models without verification.<n>We call on the community to (i) research tamper-resistant models and countermeasures against malicious model editing, and (ii) actively engage in securing the AI ecosystem.
- Score: 5.6897620867951435
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large Language Models (LLMs) contain large amounts of facts about the world. These facts can become outdated over time, which has led to the development of knowledge editing methods (KEs) that can change specific facts in LLMs with limited side effects. This position paper argues that editing LLMs poses serious safety risks that have been largely overlooked. First, we note the fact that KEs are widely available, computationally inexpensive, highly performant, and stealthy makes them an attractive tool for malicious actors. Second, we discuss malicious use cases of KEs, showing how KEs can be easily adapted for a variety of malicious purposes. Third, we highlight vulnerabilities in the AI ecosystem that allow unrestricted uploading and downloading of updated models without verification. Fourth, we argue that a lack of social and institutional awareness exacerbates this risk, and discuss the implications for different stakeholders. We call on the community to (i) research tamper-resistant models and countermeasures against malicious model editing, and (ii) actively engage in securing the AI ecosystem.
Related papers
- Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models [50.89022445197919]
Large Audio Language Models (LALMs) have extended the capabilities of Large Language Models (LLMs)<n>Recent research has revealed that LALMs remain vulnerable to harmful queries due to insufficient safety-alignment.
arXiv Detail & Related papers (2025-05-26T08:25:25Z) - Security Concerns for Large Language Models: A Survey [3.175227858236288]
Large Language Models (LLMs) have caused a revolution in natural language processing, but their capabilities also introduce new security vulnerabilities.<n>We provide a comprehensive overview of the emerging security concerns around LLMs, categorizing threats into prompt injection and jailbreaking, adversarial attacks such as input perturbations and data poisoning, and worrisome risks inherent in autonomous LLM agents.<n>We conclude by emphasizing the importance of advancing robust, multi-layered security strategies to ensure LLMs are safe and beneficial.
arXiv Detail & Related papers (2025-05-24T22:22:43Z) - Dark LLMs: The Growing Threat of Unaligned AI Models [8.183446952097528]
Large Language Models (LLMs) rapidly reshape modern life, advancing fields from healthcare to education and beyond.<n>The vulnerability of LLMs to jailbreak attacks stems from the very data they learn from.<n>Our research identifies the growing threat posed by dark LLMs models deliberately designed without ethical guardrails or modified through jailbreak techniques.
arXiv Detail & Related papers (2025-05-15T08:07:04Z) - Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs [83.11815479874447]
We propose a novel jailbreak attack framework, inspired by cognitive decomposition and biases in human cognition.<n>We employ cognitive decomposition to reduce the complexity of malicious prompts and relevance bias to reorganize prompts.<n>We also introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm.
arXiv Detail & Related papers (2025-05-03T05:28:11Z) - SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models [50.34706204154244]
Acquiring reasoning capabilities catastrophically degrades inherited safety alignment.
Certain scenarios suffer 25 times higher attack rates.
Despite tight reasoning-answer safety coupling, MLRMs demonstrate nascent self-correction.
arXiv Detail & Related papers (2025-04-09T06:53:23Z) - Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework.
It reformulates harmful queries into benign reasoning tasks.
We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z) - Compromising Honesty and Harmlessness in Language Models via Deception Attacks [0.04499833362998487]
"Deception attacks" customize models to mislead users when prompted on chosen topics while remaining accurate on others.
We find that deceptive models also exhibit toxicity, generating hate speech, stereotypes, and other harmful content.
arXiv Detail & Related papers (2025-02-12T11:02:59Z) - Jailbreaking Large Language Models in Infinitely Many Ways [3.5674816606221182]
We show how one can bypass the safeguards of the most powerful open- and closed-source LLMs and generate content that explicitly violates their safety policies.<n>For two categories of attacks that are straightforward to implement, we discuss two defensive strategies, one in token and the other in embedding space.
arXiv Detail & Related papers (2025-01-18T15:39:53Z) - Emerging Security Challenges of Large Language Models [6.151633954305939]
Large language models (LLMs) have achieved record adoption in a short period of time across many different sectors.<n>They are open-ended models trained on diverse data without being tailored for specific downstream tasks.<n>Traditional Machine Learning (ML) models are vulnerable to adversarial attacks.
arXiv Detail & Related papers (2024-12-23T14:36:37Z) - LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts [88.96201324719205]
Safety concerns in large language models (LLMs) have gained significant attention due to their exposure to potentially harmful data during pre-training.<n>We identify a new safety vulnerability in LLMs, where seemingly benign prompts, semantically related to harmful content, can bypass safety mechanisms.<n>We introduce a novel attack method, textitActorBreaker, which identifies actors related to toxic prompts within pre-training distribution.
arXiv Detail & Related papers (2024-10-14T16:41:49Z) - Endless Jailbreaks with Bijection Learning [3.5963161678592828]
We introduce a powerful attack algorithm which fuzzes LLMs for safety vulnerabilities using randomly-generated encodings.
Our attack is extremely effective on a wide range of frontier language models.
arXiv Detail & Related papers (2024-10-02T07:40:56Z) - Can LLMs be Fooled? Investigating Vulnerabilities in LLMs [4.927763944523323]
The advent of Large Language Models (LLMs) has garnered significant popularity and wielded immense power across various domains within Natural Language Processing (NLP)
This paper will synthesize the findings from each vulnerability section and propose new directions of research and development.
By understanding the focal points of current vulnerabilities, we can better anticipate and mitigate future risks.
arXiv Detail & Related papers (2024-07-30T04:08:00Z) - Can Editing LLMs Inject Harm? [122.83469484328465]
We propose to reformulate knowledge editing as a new type of safety threat for Large Language Models.
For the risk of misinformation injection, we first categorize it into commonsense misinformation injection and long-tail misinformation injection.
For the risk of bias injection, we discover that not only can biased sentences be injected into LLMs with high effectiveness, but also one single biased sentence injection can cause a bias increase.
arXiv Detail & Related papers (2024-07-29T17:58:06Z) - Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes [0.0]
Large Language Models (LLMs) have gained widespread adoption across various domains, including chatbots and auto-task completion agents.
These models are susceptible to safety vulnerabilities such as jailbreaking, prompt injection, and privacy leakage attacks.
This study investigates the impact of these modifications on LLM safety, a critical consideration for building reliable and secure AI systems.
arXiv Detail & Related papers (2024-04-05T20:31:45Z) - A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly [21.536079040559517]
Large Language Models (LLMs) have revolutionized natural language understanding and generation.
This paper explores the intersection of LLMs with security and privacy.
arXiv Detail & Related papers (2023-12-04T16:25:18Z) - Cognitive Overload: Jailbreaking Large Language Models with Overloaded
Logical Thinking [60.78524314357671]
We investigate a novel category of jailbreak attacks specifically designed to target the cognitive structure and processes of large language models (LLMs)
Our proposed cognitive overload is a black-box attack with no need for knowledge of model architecture or access to model weights.
Experiments conducted on AdvBench and MasterKey reveal that various LLMs, including both popular open-source model Llama 2 and the proprietary model ChatGPT, can be compromised through cognitive overload.
arXiv Detail & Related papers (2023-11-16T11:52:22Z) - Privacy in Large Language Models: Attacks, Defenses and Future Directions [84.73301039987128]
We analyze the current privacy attacks targeting large language models (LLMs) and categorize them according to the adversary's assumed capabilities.
We present a detailed overview of prominent defense strategies that have been developed to counter these privacy attacks.
arXiv Detail & Related papers (2023-10-16T13:23:54Z) - Factuality Challenges in the Era of Large Language Models [113.3282633305118]
Large Language Models (LLMs) generate false, erroneous, or misleading content.
LLMs can be exploited for malicious applications.
This poses a significant challenge to society in terms of the potential deception of users.
arXiv Detail & Related papers (2023-10-08T14:55:02Z) - Visual Adversarial Examples Jailbreak Aligned Large Language Models [66.53468356460365]
We show that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks.
We exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision.
Our study underscores the escalating adversarial risks associated with the pursuit of multimodality.
arXiv Detail & Related papers (2023-06-22T22:13:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.