Safeguarding Efficacy in Large Language Models: Evaluating Resistance to Human-Written and Algorithmic Adversarial Prompts
- URL: http://arxiv.org/abs/2510.15973v1
- Date: Sun, 12 Oct 2025 21:48:34 GMT
- Title: Safeguarding Efficacy in Large Language Models: Evaluating Resistance to Human-Written and Algorithmic Adversarial Prompts
- Authors: Tiarnaigh Downey-Webb, Olamide Jogunola, Oluwaseun Ajao,
- Abstract summary: This paper presents a systematic security assessment of four prominent Large Language Models (LLMs) against adversarial attack vectors.<n>We evaluate Phi-2, Llama-2-7B-Chat, GPT-3.5-Turbo, and GPT-4 across four distinct attack categories: human-written prompts, AutoDAN, Greedy Coordinate Gradient (GCG), and Tree-of-Attacks-with-pruning (TAP)
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents a systematic security assessment of four prominent Large Language Models (LLMs) against diverse adversarial attack vectors. We evaluate Phi-2, Llama-2-7B-Chat, GPT-3.5-Turbo, and GPT-4 across four distinct attack categories: human-written prompts, AutoDAN, Greedy Coordinate Gradient (GCG), and Tree-of-Attacks-with-pruning (TAP). Our comprehensive evaluation employs 1,200 carefully stratified prompts from the SALAD-Bench dataset, spanning six harm categories. Results demonstrate significant variations in model robustness, with Llama-2 achieving the highest overall security (3.4% average attack success rate) while Phi-2 exhibits the greatest vulnerability (7.0% average attack success rate). We identify critical transferability patterns where GCG and TAP attacks, though ineffective against their target model (Llama-2), achieve substantially higher success rates when transferred to other models (up to 17% for GPT-4). Statistical analysis using Friedman tests reveals significant differences in vulnerability across harm categories ($p < 0.001$), with malicious use prompts showing the highest attack success rates (10.71% average). Our findings contribute to understanding cross-model security vulnerabilities and provide actionable insights for developing targeted defense mechanisms
Related papers
- Penetration Testing of Agentic AI: A Comparative Security Analysis Across Models and Frameworks [0.0]
Agentic AI introduces security vulnerabilities that traditional LLM safeguards fail to address.<n>We conduct the first systematic testing and comparative evaluation of agentic AI systems.<n>We identify six distinct defensive behavior patterns including a novel "hallucinated compliance" strategy.
arXiv Detail & Related papers (2025-12-16T19:22:50Z) - Replicating TEMPEST at Scale: Multi-Turn Adversarial Attacks Against Trillion-Parameter Frontier Models [0.0]
This study employed the TEMPEST multi-turn attack framework to evaluate ten frontier models from eight vendors across 1,000 harmful behaviors.<n>Six models achieved 96% to 100% attack success rate (ASR), while four showed meaningful resistance, with ASR ranging from 42% to 78%.
arXiv Detail & Related papers (2025-12-08T00:30:40Z) - Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks [0.0]
Large Language Model (LLM) safety guardrail models have emerged as a primary defense mechanism against harmful content generation.<n>This study evaluated ten publicly available guardrail models from Meta, Google, IBM, NVIDIA, Alibaba, and Allen AI across 1,445 test prompts spanning 21 attack categories.
arXiv Detail & Related papers (2025-11-27T03:01:09Z) - DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models [50.21378052667732]
We conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics.<n>We propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach.
arXiv Detail & Related papers (2025-09-29T05:17:10Z) - An Automated Attack Investigation Approach Leveraging Threat-Knowledge-Augmented Large Language Models [17.220143037047627]
Advanced Persistent Threats (APTs) compromise high-value systems to steal data or disrupt operations.<n>Existing methods suffer from poor platform generality, limited generalization to evolving tactics, and an inability to produce analyst-ready reports.<n>We present an LLM-empowered attack investigation framework augmented with a dynamically adaptable Kill-Chain-aligned threat knowledge base.
arXiv Detail & Related papers (2025-09-01T08:57:01Z) - The Resurgence of GCG Adversarial Attacks on Large Language Models [4.157278627741554]
We present a systematic appraisal of GCG and its variant, TGCG, across open-source landscapes.<n>Attack success rates decrease with model size, reflecting increasing complexity.<n> coding prompts are more vulnerable than adversarial safety prompts, suggesting that reasoning itself can be exploited as an attack vector.
arXiv Detail & Related papers (2025-08-30T07:04:29Z) - When Developer Aid Becomes Security Debt: A Systematic Analysis of Insecure Behaviors in LLM Coding Agents [1.7587442088965226]
LLM-based coding agents are rapidly being deployed in software development, yet their safety implications remain poorly understood.<n>We conducted the first systematic safety evaluation of autonomous coding agents, analyzing over 12,000 actions across five state-of-the-art models.<n>We developed a high-precision detection system that identified four major vulnerability categories, with information exposure the most prevalent.
arXiv Detail & Related papers (2025-07-12T16:11:07Z) - Evaluating the Robustness of Adversarial Defenses in Malware Detection Systems [2.209921757303168]
We introduce a technique to convert continuous perturbations into binary feature spaces while preserving high attack success and low perturbation size.<n>Second, we present a novel adversarial method for binary domains, designed to achieve attack goals with minimal feature changes.<n> Experiments on the Malscan dataset show that sigma-binary outperforms existing attacks and exposes key vulnerabilities in state-of-the-art defenses.
arXiv Detail & Related papers (2025-05-14T12:38:43Z) - T2V-OptJail: Discrete Prompt Optimization for Text-to-Video Jailbreak Attacks [67.91652526657599]
We formalize the T2V jailbreak attack as a discrete optimization problem and propose a joint objective-based optimization framework, called T2V-OptJail.<n>We conduct large-scale experiments on several T2V models, covering both open-source models and real commercial closed-source models.<n>The proposed method improves 11.4% and 10.0% over the existing state-of-the-art method in terms of attack success rate.
arXiv Detail & Related papers (2025-05-10T16:04:52Z) - Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data.<n>These findings highlight the reliance on recall over rigorous logical inference.<n>This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z) - Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models [92.79804303337522]
Vision-Language Models (VLMs) may still be vulnerable to safety alignment issues.<n>We introduce MLAI, a novel jailbreak framework that leverages scenario-aware image generation for semantic alignment.<n>Extensive experiments demonstrate MLAI's significant impact, achieving attack success rates of 77.75% on MiniGPT-4 and 82.80% on LLaVA-2.
arXiv Detail & Related papers (2024-11-27T02:40:29Z) - EaTVul: ChatGPT-based Evasion Attack Against Software Vulnerability Detection [19.885698402507145]
Adversarial examples can exploit vulnerabilities within deep neural networks.
This study showcases the susceptibility of deep learning models to adversarial attacks, which can achieve 100% attack success rate.
arXiv Detail & Related papers (2024-07-27T09:04:54Z) - Preference Poisoning Attacks on Reward Model Learning [47.00395978031771]
We investigate the nature and extent of a vulnerability in learning reward models from pairwise comparisons.
We propose two classes of algorithmic approaches for these attacks: a gradient-based framework, and several variants of rank-by-distance methods.
We find that the best attacks are often highly successful, achieving in the most extreme case 100% success rate with only 0.3% of the data poisoned.
arXiv Detail & Related papers (2024-02-02T21:45:24Z) - G$^2$uardFL: Safeguarding Federated Learning Against Backdoor Attacks
through Attributed Client Graph Clustering [116.4277292854053]
Federated Learning (FL) offers collaborative model training without data sharing.
FL is vulnerable to backdoor attacks, where poisoned model weights lead to compromised system integrity.
We present G$2$uardFL, a protective framework that reinterprets the identification of malicious clients as an attributed graph clustering problem.
arXiv Detail & Related papers (2023-06-08T07:15:04Z) - Reliable evaluation of adversarial robustness with an ensemble of
diverse parameter-free attacks [65.20660287833537]
In this paper we propose two extensions of the PGD-attack overcoming failures due to suboptimal step size and problems of the objective function.
We then combine our novel attacks with two complementary existing ones to form a parameter-free, computationally affordable and user-independent ensemble of attacks to test adversarial robustness.
arXiv Detail & Related papers (2020-03-03T18:15:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.