How Robust is Google's Bard to Adversarial Image Attacks?
- URL: http://arxiv.org/abs/2309.11751v2
- Date: Sat, 14 Oct 2023 12:56:13 GMT
- Title: How Robust is Google's Bard to Adversarial Image Attacks?
- Authors: Yinpeng Dong, Huanran Chen, Jiawei Chen, Zhengwei Fang, Xiao Yang,
Yichi Zhang, Yu Tian, Hang Su, Jun Zhu
- Abstract summary: Multimodal Large Language Models (MLLMs) that integrate text and other modalities (especially vision) have achieved unprecedented performance in various multimodal tasks.
However, due to the unsolved adversarial robustness problem of vision models, MLLMs can have more severe safety and security risks.
We study the adversarial robustness of Google's Bard to better understand the vulnerabilities of commercial MLLMs.
- Score: 45.92999116520135
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multimodal Large Language Models (MLLMs) that integrate text and other
modalities (especially vision) have achieved unprecedented performance in
various multimodal tasks. However, due to the unsolved adversarial robustness
problem of vision models, MLLMs can have more severe safety and security risks
by introducing the vision inputs. In this work, we study the adversarial
robustness of Google's Bard, a competitive chatbot to ChatGPT that released its
multimodal capability recently, to better understand the vulnerabilities of
commercial MLLMs. By attacking white-box surrogate vision encoders or MLLMs,
the generated adversarial examples can mislead Bard to output wrong image
descriptions with a 22% success rate based solely on the transferability. We
show that the adversarial examples can also attack other MLLMs, e.g., a 26%
attack success rate against Bing Chat and a 86% attack success rate against
ERNIE bot. Moreover, we identify two defense mechanisms of Bard, including face
detection and toxicity detection of images. We design corresponding attacks to
evade these defenses, demonstrating that the current defenses of Bard are also
vulnerable. We hope this work can deepen our understanding on the robustness of
MLLMs and facilitate future research on defenses. Our code is available at
https://github.com/thu-ml/Attack-Bard.
Update: GPT-4V is available at October 2023. We further evaluate its
robustness under the same set of adversarial examples, achieving a 45% attack
success rate.
Related papers
- Universal Adversarial Attack on Aligned Multimodal LLMs [1.5146068448101746]
We propose a universal adversarial attack on multimodal Large Language Models (LLMs)
We craft a synthetic image that forces the model to respond with a targeted phrase or otherwise unsafe content.
We will release code and datasets under the Apache-2.0 license.
arXiv Detail & Related papers (2025-02-11T22:07:47Z) - Effective Black-Box Multi-Faceted Attacks Breach Vision Large Language Model Guardrails [32.627286570942445]
MultiFaceted Attack is an attack framework designed to bypass Multi-Layered Defenses in Vision Large Language Models.
It exploits the multimodal nature of VLLMs to inject toxic system prompts through images.
It achieves a 61.56% attack success rate, surpassing state-of-the-art methods by at least 42.18%.
arXiv Detail & Related papers (2025-02-09T04:21:27Z) - Retention Score: Quantifying Jailbreak Risks for Vision Language Models [60.48306899271866]
Vision-Language Models (VLMs) are integrated with Large Language Models (LLMs) to enhance multi-modal machine learning capabilities.
This paper aims to assess the resilience of VLMs against jailbreak attacks that can compromise model safety compliance and result in harmful outputs.
To evaluate a VLM's ability to maintain its robustness against adversarial input perturbations, we propose a novel metric called the textbfRetention Score.
arXiv Detail & Related papers (2024-12-23T13:05:51Z) - Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models [92.79804303337522]
Vision-Language Models (VLMs) may still be vulnerable to safety alignment issues.
We introduce MLAI, a novel jailbreak framework that leverages scenario-aware image generation for semantic alignment.
Extensive experiments demonstrate MLAI's significant impact, achieving attack success rates of 77.75% on MiniGPT-4 and 82.80% on LLaVA-2.
arXiv Detail & Related papers (2024-11-27T02:40:29Z) - The Best Defense is a Good Offense: Countering LLM-Powered Cyberattacks [2.6528263069045126]
Large language models (LLMs) could soon become integral to autonomous cyber agents.
We introduce novel defense strategies that exploit the inherent vulnerabilities of attacking LLMs.
Our results show defense success rates of up to 90%, demonstrating the effectiveness of turning LLM vulnerabilities into defensive strategies.
arXiv Detail & Related papers (2024-10-20T14:07:24Z) - White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models.
Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input.
An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
arXiv Detail & Related papers (2024-05-28T07:13:30Z) - SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs)
Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z) - Baseline Defenses for Adversarial Attacks Against Aligned Language
Models [109.75753454188705]
Recent work shows that text moderations can produce jailbreaking prompts that bypass defenses.
We look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training.
We find that the weakness of existing discretes for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs.
arXiv Detail & Related papers (2023-09-01T17:59:44Z) - Arms Race in Adversarial Malware Detection: A Survey [33.8941961394801]
Malicious software (malware) is a major cyber threat that has to be tackled with Machine Learning (ML) techniques.
ML is vulnerable to attacks known as adversarial examples.
Knowing the defender's feature set is critical to the success of transfer attacks.
The effectiveness of adversarial training depends on the defender's capability in identifying the most powerful attack.
arXiv Detail & Related papers (2020-05-24T07:20:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.