AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization
- URL: http://arxiv.org/abs/2504.01735v1
- Date: Wed, 02 Apr 2025 13:43:21 GMT
- Title: AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization
- Authors: Chaohu Liu, Tianyi Gui, Yu Liu, Linli Xu,
- Abstract summary: We propose AdPO, a novel adversarial defense strategy for LVLMs based on preference optimization.<n>For the first time, we reframe adversarial training as a preference optimization problem, aiming to enhance the model's preference for generating normal outputs on clean inputs.<n>We validate that training on smaller LVLMs can achieve competitive performance while maintaining efficiency comparable to baseline methods.
- Score: 11.381262184752234
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Vision-Language Models (LVLMs), such as GPT-4o and LLaVA, have recently witnessed remarkable advancements and are increasingly being deployed in real-world applications. However, inheriting the sensitivity of visual neural networks, LVLMs remain vulnerable to adversarial attacks, which can result in erroneous or malicious outputs. While existing efforts utilize adversarial fine-tuning to enhance robustness, they often suffer from performance degradation on clean inputs. In this paper, we proposes AdPO, a novel adversarial defense strategy for LVLMs based on preference optimization. For the first time, we reframe adversarial training as a preference optimization problem, aiming to enhance the model's preference for generating normal outputs on clean inputs while rejecting the potential misleading outputs for adversarial examples. Notably, AdPO achieves this by solely modifying the image encoder, e.g., CLIP ViT, resulting in superior clean and adversarial performance in a variety of downsream tasks. Considering that training involves large language models (LLMs), the computational cost increases significantly. We validate that training on smaller LVLMs and subsequently transferring to larger models can achieve competitive performance while maintaining efficiency comparable to baseline methods. Our comprehensive experiments confirm the effectiveness of the proposed AdPO, which provides a novel perspective for future adversarial defense research.
Related papers
- Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation [29.579349371114702]
Direct Preference Optimization (DPO) is a cost-effective alternative to reinforcement learning (RL) for large language models (LLMs)
We show that a single round of DPO with coarse filtering significantly enhances mathematical reasoning performance.
With simple verifiable rewards, our model achieves RL-level performance with significantly lower computational overhead.
arXiv Detail & Related papers (2025-03-17T06:28:25Z) - A Survey of Direct Preference Optimization [103.59317151002693]
Large Language Models (LLMs) have demonstrated unprecedented generative capabilities.
Their alignment with human values remains critical for ensuring helpful and harmless deployments.
Direct Preference Optimization (DPO) has recently gained prominence as a streamlined alternative.
arXiv Detail & Related papers (2025-03-12T08:45:15Z) - From Captions to Rewards (CAREVL): Leveraging Large Language Model Experts for Enhanced Reward Modeling in Large Vision-Language Models [58.16075709485292]
CAREVL is a novel method for preference reward modeling by reliably using both high- and low-confidence data.<n> CAREVL achieves performance improvements over traditional distillation-based methods on VL-RewardBench and MLLM-as-a-Judge benchmark.
arXiv Detail & Related papers (2025-03-08T16:13:18Z) - A Simple and Effective Reinforcement Learning Method for Text-to-Image Diffusion Fine-tuning [61.403275660120606]
Reinforcement learning (RL)-based fine-tuning has emerged as a powerful approach for aligning diffusion models with black-box objectives.<n>We propose leave-one-out PPO (LOOP), a novel RL for diffusion fine-tuning method.<n>Our results demonstrate that LOOP effectively improves diffusion models on various black-box objectives, and achieves a better balance between computational efficiency and performance.
arXiv Detail & Related papers (2025-03-02T13:43:53Z) - Efficient and Effective Universal Adversarial Attack against Vision-Language Pre-training Models [14.828324088905772]
Non-universal adversarial attacks are often impractical for real-time online applications due to their high computational demands per data instance.
We propose a direct optimization-based UAP approach, termed DO-UAP, which significantly reduces resource consumption while maintaining high attack performance.
arXiv Detail & Related papers (2024-10-15T14:29:47Z) - CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs [37.98496239547762]
Large Vision Language Models are prone to hallucinating details like objects and their properties or relations, limiting their real-world deployment.
We present CLIP-DPO, a preference optimization method that leverages contrastively pre-trained Vision-Language (VL) embedding models, such as CLIP, for DPO-based optimization of LVLMs.
arXiv Detail & Related papers (2024-08-19T21:56:20Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content [62.685566387625975]
Current mitigation strategies, while effective, are not resilient under adversarial attacks.
This paper introduces Resilient Guardrails for Large Language Models (RigorLLM), a novel framework designed to efficiently moderate harmful and unsafe inputs.
arXiv Detail & Related papers (2024-03-19T07:25:02Z) - Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation [46.61909578101735]
Adversarial Policy Optimization (AdvPO) is a novel solution to the pervasive issue of reward over-optimization in Reinforcement Learning from Human Feedback.
In this paper, we introduce a lightweight way to quantify uncertainties in rewards, relying solely on the last layer embeddings of the reward model.
arXiv Detail & Related papers (2024-03-08T09:20:12Z) - On Evaluating Adversarial Robustness of Large Vision-Language Models [64.66104342002882]
We evaluate the robustness of large vision-language models (VLMs) in the most realistic and high-risk setting.
In particular, we first craft targeted adversarial examples against pretrained models such as CLIP and BLIP.
Black-box queries on these VLMs can further improve the effectiveness of targeted evasion.
arXiv Detail & Related papers (2023-05-26T13:49:44Z) - A Prompting-based Approach for Adversarial Example Generation and
Robustness Enhancement [18.532308729844598]
We propose a novel prompt-based adversarial attack to compromise NLP models.
We generate adversarial examples via mask-and-filling under the effect of a malicious purpose.
Our training method does not actually generate adversarial samples, it can be applied to large-scale training sets efficiently.
arXiv Detail & Related papers (2022-03-21T03:21:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.