How Robust is Google's Bard to Adversarial Image Attacks?
- URL: http://arxiv.org/abs/2309.11751v2
- Date: Sat, 14 Oct 2023 12:56:13 GMT
- Title: How Robust is Google's Bard to Adversarial Image Attacks?
- Authors: Yinpeng Dong, Huanran Chen, Jiawei Chen, Zhengwei Fang, Xiao Yang,
Yichi Zhang, Yu Tian, Hang Su, Jun Zhu
- Abstract summary: Multimodal Large Language Models (MLLMs) that integrate text and other modalities (especially vision) have achieved unprecedented performance in various multimodal tasks.
However, due to the unsolved adversarial robustness problem of vision models, MLLMs can have more severe safety and security risks.
We study the adversarial robustness of Google's Bard to better understand the vulnerabilities of commercial MLLMs.
- Score: 45.92999116520135
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multimodal Large Language Models (MLLMs) that integrate text and other
modalities (especially vision) have achieved unprecedented performance in
various multimodal tasks. However, due to the unsolved adversarial robustness
problem of vision models, MLLMs can have more severe safety and security risks
by introducing the vision inputs. In this work, we study the adversarial
robustness of Google's Bard, a competitive chatbot to ChatGPT that released its
multimodal capability recently, to better understand the vulnerabilities of
commercial MLLMs. By attacking white-box surrogate vision encoders or MLLMs,
the generated adversarial examples can mislead Bard to output wrong image
descriptions with a 22% success rate based solely on the transferability. We
show that the adversarial examples can also attack other MLLMs, e.g., a 26%
attack success rate against Bing Chat and a 86% attack success rate against
ERNIE bot. Moreover, we identify two defense mechanisms of Bard, including face
detection and toxicity detection of images. We design corresponding attacks to
evade these defenses, demonstrating that the current defenses of Bard are also
vulnerable. We hope this work can deepen our understanding on the robustness of
MLLMs and facilitate future research on defenses. Our code is available at
https://github.com/thu-ml/Attack-Bard.
Update: GPT-4V is available at October 2023. We further evaluate its
robustness under the same set of adversarial examples, achieving a 45% attack
success rate.
Related papers
- The Best Defense is a Good Offense: Countering LLM-Powered Cyberattacks [2.6528263069045126]
Large language models (LLMs) could soon become integral to autonomous cyber agents.
We introduce novel defense strategies that exploit the inherent vulnerabilities of attacking LLMs.
Our results show defense success rates of up to 90%, demonstrating the effectiveness of turning LLM vulnerabilities into defensive strategies.
arXiv Detail & Related papers (2024-10-20T14:07:24Z) - Adversarial Attacks on Multimodal Agents [73.97379283655127]
Vision-enabled language models (VLMs) are now used to build autonomous multimodal agents capable of taking actions in real environments.
We show that multimodal agents raise new safety risks, even though attacking agents is more challenging than prior attacks due to limited access to and knowledge about the environment.
arXiv Detail & Related papers (2024-06-18T17:32:48Z) - White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models.
Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input.
An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
arXiv Detail & Related papers (2024-05-28T07:13:30Z) - SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs)
Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z) - Baseline Defenses for Adversarial Attacks Against Aligned Language
Models [109.75753454188705]
Recent work shows that text moderations can produce jailbreaking prompts that bypass defenses.
We look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training.
We find that the weakness of existing discretes for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs.
arXiv Detail & Related papers (2023-09-01T17:59:44Z) - Universal and Transferable Adversarial Attacks on Aligned Language
Models [118.41733208825278]
We propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors.
Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable.
arXiv Detail & Related papers (2023-07-27T17:49:12Z) - Arms Race in Adversarial Malware Detection: A Survey [33.8941961394801]
Malicious software (malware) is a major cyber threat that has to be tackled with Machine Learning (ML) techniques.
ML is vulnerable to attacks known as adversarial examples.
Knowing the defender's feature set is critical to the success of transfer attacks.
The effectiveness of adversarial training depends on the defender's capability in identifying the most powerful attack.
arXiv Detail & Related papers (2020-05-24T07:20:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.