Related papers: A LLM Assisted Exploitation of AI-Guardian

A LLM Assisted Exploitation of AI-Guardian

URL: http://arxiv.org/abs/2307.15008v1
Date: Thu, 20 Jul 2023 17:33:25 GMT
Title: A LLM Assisted Exploitation of AI-Guardian
Authors: Nicholas Carlini
Abstract summary: We evaluate the robustness of AI-Guardian, a recent defense to adversarial examples published at IEEE S&P 2023. We write none of the code to attack this model, and instead prompt GPT-4 to implement all attack algorithms following our instructions and guidance. This process was surprisingly effective and efficient, with the language model at times producing code from ambiguous instructions faster than the author of this paper could have done.
Score: 57.572998144258705
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are now highly capable at a diverse range of tasks. This paper studies whether or not GPT-4, one such LLM, is capable of assisting researchers in the field of adversarial machine learning. As a case study, we evaluate the robustness of AI-Guardian, a recent defense to adversarial examples published at IEEE S&P 2023, a top computer security conference. We completely break this defense: the proposed scheme does not increase robustness compared to an undefended baseline. We write none of the code to attack this model, and instead prompt GPT-4 to implement all attack algorithms following our instructions and guidance. This process was surprisingly effective and efficient, with the language model at times producing code from ambiguous instructions faster than the author of this paper could have done. We conclude by discussing (1) the warning signs present in the evaluation that suggested to us AI-Guardian would be broken, and (2) our experience with designing attacks and performing novel research using the most recent advances in language modeling.

Related papers

Paper Summary Attack: Jailbreaking LLMs through LLM Safety Papers [61.57691030102618]
We propose a novel jailbreaking method, Paper Summary Attack (llmnamePSA)<n>It synthesizes content from either attack-focused or defense-focused LLM safety paper to construct an adversarial prompt template.<n>Experiments show significant vulnerabilities not only in base LLMs, but also in state-of-the-art reasoning model like Deepseek-R1.
arXiv Detail & Related papers (2025-07-17T18:33:50Z)
Evaluating Software Plagiarism Detection in the Age of AI: Automated Obfuscation and Lessons for Academic Integrity [0.0]
Plagiarism in programming assignments is a persistent issue in computer science education.<n>Software plagiarism detectors are widely used to identify suspicious similarities at scale.<n>They are vulnerable to advanced obfuscation based on structural modification of program code.
arXiv Detail & Related papers (2025-05-26T15:59:01Z)
InjectLab: A Tactical Framework for Adversarial Threat Modeling Against Large Language Models [0.0]
This paper introduces InjectLab as a structured, open-source matrix that maps real-world techniques used to manipulate language models.<n>The framework is inspired by MITRE ATT&CK and focuses specifically on adversarial behavior at the prompt layer.<n>It includes over 25 techniques organized under six core tactics, covering threats like instruction override, identity swapping, and multi-agent exploitation.
arXiv Detail & Related papers (2025-04-16T05:00:56Z)
AutoAdvExBench: Benchmarking autonomous exploitation of adversarial example defenses [66.87883360545361]
AutoAdvExBench is a benchmark to evaluate if large language models (LLMs) can autonomously exploit defenses to adversarial examples. We design a strong agent that is capable of breaking 75% of CTF-like ("homework exercise") adversarial example defenses. We show that this agent is only able to succeed on 13% of the real-world defenses in our benchmark, indicating the large gap between difficulty in attacking "real" code, and CTF-like code.
arXiv Detail & Related papers (2025-03-03T18:39:48Z)
The TIP of the Iceberg: Revealing a Hidden Class of Task-in-Prompt Adversarial Attacks on LLMs [1.9424018922013224]
We present a novel class of jailbreak adversarial attacks on LLMs. Our approach embeds sequence-to-sequence tasks into the model's prompt to indirectly generate prohibited inputs. We demonstrate that our techniques successfully circumvent safeguards in six state-of-the-art language models.
arXiv Detail & Related papers (2025-01-27T12:48:47Z)
Defense Against Prompt Injection Attack by Leveraging Attack Techniques [66.65466992544728]
Large language models (LLMs) have achieved remarkable performance across various natural language processing (NLP) tasks. As LLMs continue to evolve, new vulnerabilities, especially prompt injection attacks arise. Recent attack methods leverage LLMs' instruction-following abilities and their inabilities to distinguish instructions injected in the data content.
arXiv Detail & Related papers (2024-11-01T09:14:21Z)
'Quis custodiet ipsos custodes?' Who will watch the watchmen? On Detecting AI-generated peer-reviews [20.030884734361358]
There is a growing concern that AI-generated texts could compromise scientific publishing, including peer-review. We introduce the Term Frequency (TF) model, which posits that AI often repeats tokens, and the Review Regeneration (RR) model, which is based on the idea that ChatGPT generates similar outputs upon re-prompting. Our findings suggest both our proposed methods perform better than the other AI text detectors.
arXiv Detail & Related papers (2024-10-13T08:06:08Z)
Large Language Models for Secure Code Assessment: A Multi-Language Empirical Study [1.9116784879310031]
We show that GPT-4o achieves the highest vulnerability detection and CWE classification scores using a few-shot setting. We develop a library called CODEGUARDIAN integrated with VSCode which enables developers to perform LLM-assisted real-time vulnerability analysis.
arXiv Detail & Related papers (2024-08-12T18:10:11Z)
A Preliminary Study on Using Large Language Models in Software Pentesting [2.0551676463612636]
Large language models (LLM) are perceived to offer promising potentials for automating security tasks. We investigate the use of LLMs in software pentesting, where the main task is to automatically identify software security vulnerabilities in source code.
arXiv Detail & Related papers (2024-01-30T21:42:59Z)
A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in Large Language Models [0.0]
This article explores two attack categories: attacks on models themselves and attacks on model applications. The former requires expertise, access to model data, and significant implementation time. The latter is more accessible to attackers and has seen increased attention.
arXiv Detail & Related papers (2023-12-18T07:07:32Z)
Towards more Practical Threat Models in Artificial Intelligence Security [66.67624011455423]
Recent works have identified a gap between research and practice in artificial intelligence security. We revisit the threat models of the six most studied attacks in AI security research and match them to AI usage in practice.
arXiv Detail & Related papers (2023-11-16T16:09:44Z)
Baseline Defenses for Adversarial Attacks Against Aligned Language Models [109.75753454188705]
Recent work shows that text moderations can produce jailbreaking prompts that bypass defenses. We look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training. We find that the weakness of existing discretes for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs.
arXiv Detail & Related papers (2023-09-01T17:59:44Z)
Identifying and Mitigating the Security Risks of Generative AI [179.2384121957896]
This paper reports the findings of a workshop held at Google on the dual-use dilemma posed by GenAI. GenAI can be used just as well by attackers to generate new attacks and increase the velocity and efficacy of existing attacks. We discuss short-term and long-term goals for the community on this topic.
arXiv Detail & Related papers (2023-08-28T18:51:09Z)
Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks. We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations. All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.