Related papers: Safeguarding Efficacy in Large Language Models: Evaluating Resistance to Human-Written and Algorithmic Adversarial Prompts

Safeguarding Efficacy in Large Language Models: Evaluating Resistance to Human-Written and Algorithmic Adversarial Prompts

URL: http://arxiv.org/abs/2510.15973v1
Date: Sun, 12 Oct 2025 21:48:34 GMT
Title: Safeguarding Efficacy in Large Language Models: Evaluating Resistance to Human-Written and Algorithmic Adversarial Prompts
Authors: Tiarnaigh Downey-Webb, Olamide Jogunola, Oluwaseun Ajao,
Abstract summary: This paper presents a systematic security assessment of four prominent Large Language Models (LLMs) against adversarial attack vectors.<n>We evaluate Phi-2, Llama-2-7B-Chat, GPT-3.5-Turbo, and GPT-4 across four distinct attack categories: human-written prompts, AutoDAN, Greedy Coordinate Gradient (GCG), and Tree-of-Attacks-with-pruning (TAP)
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents a systematic security assessment of four prominent Large Language Models (LLMs) against diverse adversarial attack vectors. We evaluate Phi-2, Llama-2-7B-Chat, GPT-3.5-Turbo, and GPT-4 across four distinct attack categories: human-written prompts, AutoDAN, Greedy Coordinate Gradient (GCG), and Tree-of-Attacks-with-pruning (TAP). Our comprehensive evaluation employs 1,200 carefully stratified prompts from the SALAD-Bench dataset, spanning six harm categories. Results demonstrate significant variations in model robustness, with Llama-2 achieving the highest overall security (3.4% average attack success rate) while Phi-2 exhibits the greatest vulnerability (7.0% average attack success rate). We identify critical transferability patterns where GCG and TAP attacks, though ineffective against their target model (Llama-2), achieve substantially higher success rates when transferred to other models (up to 17% for GPT-4). Statistical analysis using Friedman tests reveals significant differences in vulnerability across harm categories ($p < 0.001$), with malicious use prompts showing the highest attack success rates (10.71% average). Our findings contribute to understanding cross-model security vulnerabilities and provide actionable insights for developing targeted defense mechanisms

Related papers

Penetration Testing of Agentic AI: A Comparative Security Analysis Across Models and Frameworks [0.0]
Agentic AI introduces security vulnerabilities that traditional LLM safeguards fail to address.<n>We conduct the first systematic testing and comparative evaluation of agentic AI systems.<n>We identify six distinct defensive behavior patterns including a novel "hallucinated compliance" strategy.
arXiv Detail & Related papers (2025-12-16T19:22:50Z)
Replicating TEMPEST at Scale: Multi-Turn Adversarial Attacks Against Trillion-Parameter Frontier Models [0.0]
This study employed the TEMPEST multi-turn attack framework to evaluate ten frontier models from eight vendors across 1,000 harmful behaviors.<n>Six models achieved 96% to 100% attack success rate (ASR), while four showed meaningful resistance, with ASR ranging from 42% to 78%.
arXiv Detail & Related papers (2025-12-08T00:30:40Z)
Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks [0.0]
Large Language Model (LLM) safety guardrail models have emerged as a primary defense mechanism against harmful content generation.<n>This study evaluated ten publicly available guardrail models from Meta, Google, IBM, NVIDIA, Alibaba, and Allen AI across 1,445 test prompts spanning 21 attack categories.
arXiv Detail & Related papers (2025-11-27T03:01:09Z)
DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models [50.21378052667732]
We conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics.<n>We propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach.
arXiv Detail & Related papers (2025-09-29T05:17:10Z)
An Automated Attack Investigation Approach Leveraging Threat-Knowledge-Augmented Large Language Models [17.220143037047627]
Advanced Persistent Threats (APTs) compromise high-value systems to steal data or disrupt operations.<n>Existing methods suffer from poor platform generality, limited generalization to evolving tactics, and an inability to produce analyst-ready reports.<n>We present an LLM-empowered attack investigation framework augmented with a dynamically adaptable Kill-Chain-aligned threat knowledge base.
arXiv Detail & Related papers (2025-09-01T08:57:01Z)
The Resurgence of GCG Adversarial Attacks on Large Language Models [4.157278627741554]
We present a systematic appraisal of GCG and its variant, TGCG, across open-source landscapes.<n>Attack success rates decrease with model size, reflecting increasing complexity.<n> coding prompts are more vulnerable than adversarial safety prompts, suggesting that reasoning itself can be exploited as an attack vector.
arXiv Detail & Related papers (2025-08-30T07:04:29Z)
When Developer Aid Becomes Security Debt: A Systematic Analysis of Insecure Behaviors in LLM Coding Agents [1.7587442088965226]
LLM-based coding agents are rapidly being deployed in software development, yet their safety implications remain poorly understood.<n>We conducted the first systematic safety evaluation of autonomous coding agents, analyzing over 12,000 actions across five state-of-the-art models.<n>We developed a high-precision detection system that identified four major vulnerability categories, with information exposure the most prevalent.
arXiv Detail & Related papers (2025-07-12T16:11:07Z)
Evaluating the Robustness of Adversarial Defenses in Malware Detection Systems [2.209921757303168]
We introduce a technique to convert continuous perturbations into binary feature spaces while preserving high attack success and low perturbation size.<n>Second, we present a novel adversarial method for binary domains, designed to achieve attack goals with minimal feature changes.<n> Experiments on the Malscan dataset show that sigma-binary outperforms existing attacks and exposes key vulnerabilities in state-of-the-art defenses.
arXiv Detail & Related papers (2025-05-14T12:38:43Z)
T2V-OptJail: Discrete Prompt Optimization for Text-to-Video Jailbreak Attacks [67.91652526657599]
We formalize the T2V jailbreak attack as a discrete optimization problem and propose a joint objective-based optimization framework, called T2V-OptJail.<n>We conduct large-scale experiments on several T2V models, covering both open-source models and real commercial closed-source models.<n>The proposed method improves 11.4% and 10.0% over the existing state-of-the-art method in terms of attack success rate.
arXiv Detail & Related papers (2025-05-10T16:04:52Z)
Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data.<n>These findings highlight the reliance on recall over rigorous logical inference.<n>This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z)
Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models [92.79804303337522]
Vision-Language Models (VLMs) may still be vulnerable to safety alignment issues.<n>We introduce MLAI, a novel jailbreak framework that leverages scenario-aware image generation for semantic alignment.<n>Extensive experiments demonstrate MLAI's significant impact, achieving attack success rates of 77.75% on MiniGPT-4 and 82.80% on LLaVA-2.
arXiv Detail & Related papers (2024-11-27T02:40:29Z)
EaTVul: ChatGPT-based Evasion Attack Against Software Vulnerability Detection [19.885698402507145]
Adversarial examples can exploit vulnerabilities within deep neural networks. This study showcases the susceptibility of deep learning models to adversarial attacks, which can achieve 100% attack success rate.
arXiv Detail & Related papers (2024-07-27T09:04:54Z)
Preference Poisoning Attacks on Reward Model Learning [47.00395978031771]
We investigate the nature and extent of a vulnerability in learning reward models from pairwise comparisons. We propose two classes of algorithmic approaches for these attacks: a gradient-based framework, and several variants of rank-by-distance methods. We find that the best attacks are often highly successful, achieving in the most extreme case 100% success rate with only 0.3% of the data poisoned.
arXiv Detail & Related papers (2024-02-02T21:45:24Z)
G$^2$uardFL: Safeguarding Federated Learning Against Backdoor Attacks through Attributed Client Graph Clustering [116.4277292854053]
Federated Learning (FL) offers collaborative model training without data sharing. FL is vulnerable to backdoor attacks, where poisoned model weights lead to compromised system integrity. We present G$2$uardFL, a protective framework that reinterprets the identification of malicious clients as an attributed graph clustering problem.
arXiv Detail & Related papers (2023-06-08T07:15:04Z)
Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks [65.20660287833537]
In this paper we propose two extensions of the PGD-attack overcoming failures due to suboptimal step size and problems of the objective function. We then combine our novel attacks with two complementary existing ones to form a parameter-free, computationally affordable and user-independent ensemble of attacks to test adversarial robustness.
arXiv Detail & Related papers (2020-03-03T18:15:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.