A LLM Assisted Exploitation of AI-Guardian
- URL: http://arxiv.org/abs/2307.15008v1
- Date: Thu, 20 Jul 2023 17:33:25 GMT
- Title: A LLM Assisted Exploitation of AI-Guardian
- Authors: Nicholas Carlini
- Abstract summary: We evaluate the robustness of AI-Guardian, a recent defense to adversarial examples published at IEEE S&P 2023.
We write none of the code to attack this model, and instead prompt GPT-4 to implement all attack algorithms following our instructions and guidance.
This process was surprisingly effective and efficient, with the language model at times producing code from ambiguous instructions faster than the author of this paper could have done.
- Score: 57.572998144258705
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) are now highly capable at a diverse range of
tasks. This paper studies whether or not GPT-4, one such LLM, is capable of
assisting researchers in the field of adversarial machine learning. As a case
study, we evaluate the robustness of AI-Guardian, a recent defense to
adversarial examples published at IEEE S&P 2023, a top computer security
conference. We completely break this defense: the proposed scheme does not
increase robustness compared to an undefended baseline.
We write none of the code to attack this model, and instead prompt GPT-4 to
implement all attack algorithms following our instructions and guidance. This
process was surprisingly effective and efficient, with the language model at
times producing code from ambiguous instructions faster than the author of this
paper could have done. We conclude by discussing (1) the warning signs present
in the evaluation that suggested to us AI-Guardian would be broken, and (2) our
experience with designing attacks and performing novel research using the most
recent advances in language modeling.
Related papers
- Defense Against Prompt Injection Attack by Leveraging Attack Techniques [66.65466992544728]
Large language models (LLMs) have achieved remarkable performance across various natural language processing (NLP) tasks.
As LLMs continue to evolve, new vulnerabilities, especially prompt injection attacks arise.
Recent attack methods leverage LLMs' instruction-following abilities and their inabilities to distinguish instructions injected in the data content.
arXiv Detail & Related papers (2024-11-01T09:14:21Z) - 'Quis custodiet ipsos custodes?' Who will watch the watchmen? On Detecting AI-generated peer-reviews [20.030884734361358]
There is a growing concern that AI-generated texts could compromise scientific publishing, including peer-review.
We introduce the Term Frequency (TF) model, which posits that AI often repeats tokens, and the Review Regeneration (RR) model, which is based on the idea that ChatGPT generates similar outputs upon re-prompting.
Our findings suggest both our proposed methods perform better than the other AI text detectors.
arXiv Detail & Related papers (2024-10-13T08:06:08Z) - Large Language Models for Secure Code Assessment: A Multi-Language Empirical Study [1.9116784879310031]
We show that GPT-4o achieves the highest vulnerability detection and CWE classification scores using a few-shot setting.
We develop a library called CODEGUARDIAN integrated with VSCode which enables developers to perform LLM-assisted real-time vulnerability analysis.
arXiv Detail & Related papers (2024-08-12T18:10:11Z) - A Preliminary Study on Using Large Language Models in Software
Pentesting [2.0551676463612636]
Large language models (LLM) are perceived to offer promising potentials for automating security tasks.
We investigate the use of LLMs in software pentesting, where the main task is to automatically identify software security vulnerabilities in source code.
arXiv Detail & Related papers (2024-01-30T21:42:59Z) - A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in Large Language Models [0.0]
This article explores two attack categories: attacks on models themselves and attacks on model applications.
The former requires expertise, access to model data, and significant implementation time.
The latter is more accessible to attackers and has seen increased attention.
arXiv Detail & Related papers (2023-12-18T07:07:32Z) - Towards more Practical Threat Models in Artificial Intelligence Security [66.67624011455423]
Recent works have identified a gap between research and practice in artificial intelligence security.
We revisit the threat models of the six most studied attacks in AI security research and match them to AI usage in practice.
arXiv Detail & Related papers (2023-11-16T16:09:44Z) - Baseline Defenses for Adversarial Attacks Against Aligned Language
Models [109.75753454188705]
Recent work shows that text moderations can produce jailbreaking prompts that bypass defenses.
We look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training.
We find that the weakness of existing discretes for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs.
arXiv Detail & Related papers (2023-09-01T17:59:44Z) - Identifying and Mitigating the Security Risks of Generative AI [179.2384121957896]
This paper reports the findings of a workshop held at Google on the dual-use dilemma posed by GenAI.
GenAI can be used just as well by attackers to generate new attacks and increase the velocity and efficacy of existing attacks.
We discuss short-term and long-term goals for the community on this topic.
arXiv Detail & Related papers (2023-08-28T18:51:09Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.