Intrinsic Model Weaknesses: How Priming Attacks Unveil Vulnerabilities in Large Language Models
- URL: http://arxiv.org/abs/2502.16491v1
- Date: Sun, 23 Feb 2025 08:09:23 GMT
- Title: Intrinsic Model Weaknesses: How Priming Attacks Unveil Vulnerabilities in Large Language Models
- Authors: Yuyi Huang, Runzhe Zhan, Derek F. Wong, Lidia S. Chao, Ailin Tao,
- Abstract summary: Large language models (LLMs) have significantly influenced various industries but suffer from a critical flaw, the potential sensitivity of generating harmful content.<n>We developed and tested novel attack strategies on popular LLMs to expose their vulnerabilities in generating inappropriate content.
- Score: 40.180771969531456
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have significantly influenced various industries but suffer from a critical flaw, the potential sensitivity of generating harmful content, which poses severe societal risks. We developed and tested novel attack strategies on popular LLMs to expose their vulnerabilities in generating inappropriate content. These strategies, inspired by psychological phenomena such as the "Priming Effect", "Safe Attention Shift", and "Cognitive Dissonance", effectively attack the models' guarding mechanisms. Our experiments achieved an attack success rate (ASR) of 100% on various open-source models, including Meta's Llama-3.2, Google's Gemma-2, Mistral's Mistral-NeMo, Falcon's Falcon-mamba, Apple's DCLM, Microsoft's Phi3, and Qwen's Qwen2.5, among others. Similarly, for closed-source models such as OpenAI's GPT-4o, Google's Gemini-1.5, and Claude-3.5, we observed an ASR of at least 95% on the AdvBench dataset, which represents the current state-of-the-art. This study underscores the urgent need to reassess the use of generative models in critical applications to mitigate potential adverse societal impacts.
Related papers
- SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models [50.34706204154244]
Acquiring reasoning capabilities catastrophically degrades inherited safety alignment.
Certain scenarios suffer 25 times higher attack rates.
Despite tight reasoning-answer safety coupling, MLRMs demonstrate nascent self-correction.
arXiv Detail & Related papers (2025-04-09T06:53:23Z) - Defending Deep Neural Networks against Backdoor Attacks via Module Switching [15.979018992591032]
An exponential increase in the parameters of Deep Neural Networks (DNNs) has significantly raised the cost of independent training.
Open-source models are more vulnerable to malicious threats, such as backdoor attacks.
We propose a novel module-switching strategy to break such spurious correlations within the model's propagation path.
arXiv Detail & Related papers (2025-04-08T11:01:07Z) - Swallowing the Poison Pills: Insights from Vulnerability Disparity Among LLMs [3.7913442178940318]
Modern large language models (LLMs) exhibit critical vulnerabilities to poison pill attacks.
We demonstrate these attacks exploit inherent architectural properties of LLMs.
Our work establishes poison pills as both a security threat and diagnostic tool.
arXiv Detail & Related papers (2025-02-23T06:34:55Z) - Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities [49.09703018511403]
Large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks.<n>Currently, most risk evaluations are conducted by designing inputs that elicit harmful behaviors from the system.<n>We propose evaluating LLMs with model tampering attacks which allow for modifications to latent activations or weights.
arXiv Detail & Related papers (2025-02-03T18:59:16Z) - HarmLevelBench: Evaluating Harm-Level Compliance and the Impact of Quantization on Model Alignment [1.8843687952462742]
This paper aims to address gaps in the current literature on jailbreaking techniques and the evaluation of LLM vulnerabilities.
Our contributions include the creation of a novel dataset designed to assess the harmfulness of model outputs across multiple harm levels.
We provide a comprehensive benchmark of state-of-the-art jailbreaking attacks, specifically targeting the Vicuna 13B v1.5 model.
arXiv Detail & Related papers (2024-11-11T10:02:49Z) - Transferable Adversarial Attacks on SAM and Its Downstream Models [87.23908485521439]
This paper explores the feasibility of adversarial attacking various downstream models fine-tuned from the segment anything model (SAM)
To enhance the effectiveness of the adversarial attack towards models fine-tuned on unknown datasets, we propose a universal meta-initialization (UMI) algorithm.
arXiv Detail & Related papers (2024-10-26T15:04:04Z) - Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models [13.225041704917905]
This study unveils an attack mechanism that capitalizes on human conversation strategies to extract harmful information from large language models.
Unlike conventional methods that target explicit malicious responses, our approach delves deeper into the nature of the information provided in responses.
arXiv Detail & Related papers (2024-07-22T06:04:29Z) - A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models [20.40158210837289]
We investigate nine attack techniques and seven defense techniques applied across three distinct language models: Vicuna, LLama, and GPT-3.5 Turbo.
Our findings reveal that existing white-box attacks underperform compared to universal techniques and that including special tokens in the input significantly affects the likelihood of successful attacks.
arXiv Detail & Related papers (2024-02-21T01:26:39Z) - Evaluating Efficacy of Model Stealing Attacks and Defenses on Quantum
Neural Networks [2.348041867134616]
Cloud hosting of quantum machine learning (QML) models exposes them to a range of vulnerabilities.
Model stealing attacks can produce clone models achieving up to $0.9times$ and $0.99times$ clone test accuracy.
To defend against these attacks, we leverage the unique properties of current noisy hardware and perturb the victim model outputs.
arXiv Detail & Related papers (2024-02-18T19:35:30Z) - RelaxLoss: Defending Membership Inference Attacks without Losing Utility [68.48117818874155]
We propose a novel training framework based on a relaxed loss with a more achievable learning target.
RelaxLoss is applicable to any classification model with added benefits of easy implementation and negligible overhead.
Our approach consistently outperforms state-of-the-art defense mechanisms in terms of resilience against MIAs.
arXiv Detail & Related papers (2022-07-12T19:34:47Z) - ML-Doctor: Holistic Risk Assessment of Inference Attacks Against Machine
Learning Models [64.03398193325572]
Inference attacks against Machine Learning (ML) models allow adversaries to learn about training data, model parameters, etc.
We concentrate on four attacks - namely, membership inference, model inversion, attribute inference, and model stealing.
Our analysis relies on a modular re-usable software, ML-Doctor, which enables ML model owners to assess the risks of deploying their models.
arXiv Detail & Related papers (2021-02-04T11:35:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.