Can Adversarial Code Comments Fool AI Security Reviewers -- Large-Scale Empirical Study of Comment-Based Attacks and Defenses Against LLM Code Analysis
- URL: http://arxiv.org/abs/2602.16741v1
- Date: Wed, 18 Feb 2026 00:34:17 GMT
- Title: Can Adversarial Code Comments Fool AI Security Reviewers -- Large-Scale Empirical Study of Comment-Based Attacks and Defenses Against LLM Code Analysis
- Authors: Scott Thornton,
- Abstract summary: Adversarial comments produce small, statistically non-significant effects on detection accuracy.<n>Complex adversarial strategies offer no advantage over simple manipulative comments.<n>Comment stripping reduces detection for weaker models by removing helpful context.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: AI-assisted code review is widely used to detect vulnerabilities before production release. Prior work shows that adversarial prompt manipulation can degrade large language model (LLM) performance in code generation. We test whether similar comment-based manipulation misleads LLMs during vulnerability detection. We build a 100-sample benchmark across Python, JavaScript, and Java, each paired with eight comment variants ranging from no comments to adversarial strategies such as authority spoofing and technical deception. Eight frontier models, five commercial and three open-source, are evaluated in 9,366 trials. Adversarial comments produce small, statistically non-significant effects on detection accuracy (McNemar exact p > 0.21; all 95 percent confidence intervals include zero). This holds for commercial models with 89 to 96 percent baseline detection and open-source models with 53 to 72 percent, despite large absolute performance gaps. Unlike generation settings where comment manipulation achieves high attack success, detection performance does not meaningfully degrade. More complex adversarial strategies offer no advantage over simple manipulative comments. We test four automated defenses across 4,646 additional trials (14,012 total). Static analysis cross-referencing performs best at 96.9 percent detection and recovers 47 percent of baseline misses. Comment stripping reduces detection for weaker models by removing helpful context. Failures concentrate on inherently difficult vulnerability classes, including race conditions, timing side channels, and complex authorization logic, rather than on adversarial comments.
Related papers
- Bridging Expert Reasoning and LLM Detection: A Knowledge-Driven Framework for Malicious Packages [10.858565849895314]
Open-source ecosystems such as NPM and PyPI are increasingly targeted by supply chain attacks.<n>We present IntelGuard, a retrieval-augmented generation (RAG) based framework that integrates expert analytical reasoning into automated malicious package detection.
arXiv Detail & Related papers (2026-01-23T05:31:12Z) - Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation [76.5533899503582]
Large language models (LLMs) are increasingly used as judges to evaluate agent performance.<n>We show this paradigm implicitly assumes that the agent's chain-of-thought (CoT) reasoning faithfully reflects both its internal reasoning and the underlying environment state.<n>We demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks.
arXiv Detail & Related papers (2026-01-21T06:07:43Z) - AI Security Beyond Core Domains: Resume Screening as a Case Study of Adversarial Vulnerabilities in Specialized LLM Applications [71.27518152526686]
Large Language Models (LLMs) excel at text comprehension and generation, making them ideal for automated tasks like code review and content moderation.<n>LLMs can be manipulated by "adversarial instructions" hidden in input data, such as resumes or code, causing them to deviate from their intended task.<n>This paper introduces a benchmark to assess this vulnerability in resume screening, revealing attack success rates exceeding 80% for certain attack types.
arXiv Detail & Related papers (2025-12-23T08:42:09Z) - DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models [50.21378052667732]
We conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics.<n>We propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach.
arXiv Detail & Related papers (2025-09-29T05:17:10Z) - VADER: A Human-Evaluated Benchmark for Vulnerability Assessment, Detection, Explanation, and Remediation [0.8087612190556891]
VADER comprises 174 real-world software vulnerabilities, each carefully curated from GitHub and annotated by security experts.<n>For each vulnerability case, models are tasked with identifying the flaw, classifying it using Common Weaknession (CWE), explaining its underlying cause, proposing a patch, and formulating a test plan.<n>Using a one-shot prompting strategy, we benchmark six state-of-the-art LLMs (Claude 3.7 Sonnet, Gemini 2.5 Pro, GPT-4.1, GPT-4.5, Grok 3 Beta, and o3) on VADER.<n>Our results show that current state-of-the-
arXiv Detail & Related papers (2025-05-26T01:20:44Z) - Defending against Indirect Prompt Injection by Instruction Detection [109.30156975159561]
InstructDetector is a novel detection-based approach that leverages the behavioral states of LLMs to identify potential IPI attacks.<n>InstructDetector achieves a detection accuracy of 99.60% in the in-domain setting and 96.90% in the out-of-domain setting, and reduces the attack success rate to just 0.03% on the BIPIA benchmark.
arXiv Detail & Related papers (2025-05-08T13:04:45Z) - AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security [74.22452069013289]
AegisLLM is a cooperative multi-agent defense against adversarial attacks and information leakage.<n>We show that scaling agentic reasoning system at test-time substantially enhances robustness without compromising model utility.<n> Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM.
arXiv Detail & Related papers (2025-04-29T17:36:05Z) - Evaluate-and-Purify: Fortifying Code Language Models Against Adversarial Attacks Using LLM-as-a-Judge [3.1656947459658813]
We show that over 80% of adversarial examples generated by identifier substitution attackers are actually detectable.<n>We propose EP-Shield, a unified framework for evaluating and purifying identifier substitution attacks.
arXiv Detail & Related papers (2025-04-28T12:28:55Z) - Can LLM Prompting Serve as a Proxy for Static Analysis in Vulnerability Detection [9.269926508651091]
Large language models (LLMs) have shown limited ability on safety-critical code tasks such as vulnerability detection.<n>We propose prompting strategies that integrate natural language instructions of vulnerabilities with contrastive chain-of-thought reasoning.<n>Our findings demonstrate that security-aware prompting techniques can be effective alternatives to the laborious, hand-crafted rules of static analyzers.
arXiv Detail & Related papers (2024-12-16T18:08:14Z) - Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models.
We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - RS-Del: Edit Distance Robustness Certificates for Sequence Classifiers
via Randomized Deletion [23.309600117618025]
We adapt randomized smoothing for discrete sequence classifiers to provide certified robustness against edit distance-bounded adversaries.
Our proof of certification deviates from the established Neyman-Pearson approach, which is intractable in our setting, and is instead organized around longest common subsequences.
When applied to the popular MalConv malware detection model, our smoothing mechanism RS-Del achieves a certified accuracy of 91% at an edit distance radius of 128 bytes.
arXiv Detail & Related papers (2023-01-31T01:40:26Z) - Practical Evaluation of Adversarial Robustness via Adaptive Auto Attack [96.50202709922698]
A practical evaluation method should be convenient (i.e., parameter-free), efficient (i.e., fewer iterations) and reliable.
We propose a parameter-free Adaptive Auto Attack (A$3$) evaluation method which addresses the efficiency and reliability in a test-time-training fashion.
arXiv Detail & Related papers (2022-03-10T04:53:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.