Robustness in Both Domains: CLIP Needs a Robust Text Encoder
- URL: http://arxiv.org/abs/2506.03355v1
- Date: Tue, 03 Jun 2025 19:57:09 GMT
- Title: Robustness in Both Domains: CLIP Needs a Robust Text Encoder
- Authors: Elias Abad Rocamora, Christian Schlarmann, Naman Deep Singh, Yongtao Wu, Matthias Hein, Volkan Cevher,
- Abstract summary: LEAF is an efficient adversarial finetuning method for the text domain, with the ability to scale to large CLIP models.<n>Our models significantly improve the zero-shot adversarial accuracy in the text domain, while maintaining the vision performance provided by robust image encoders.<n>We show that robust text encoders facilitate better reconstruction of input text from its embedding via direct optimization.
- Score: 65.42617172921975
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Adversarial input attacks can cause a significant shift of CLIP embeddings. This can affect the downstream robustness of models incorporating CLIP in the pipeline, such as text-to-image generative models or large vision language models. While some efforts have been done towards making the CLIP image encoders robust, the robustness of text encoders remains unexplored. In this work, we cover this gap in the literature. We propose LEAF: an efficient adversarial finetuning method for the text domain, with the ability to scale to large CLIP models. Our models significantly improve the zero-shot adversarial accuracy in the text domain, while maintaining the vision performance provided by robust image encoders. When combined with text-to-image diffusion models, we can improve the generation quality under adversarial noise. When employing our robust CLIP encoders in multimodal retrieval tasks, we improve the recall under adversarial noise over standard CLIP models. Finally, we show that robust text encoders facilitate better reconstruction of input text from its embedding via direct optimization.
Related papers
- Decoder-Only LLMs are Better Controllers for Diffusion Models [63.22040456010123]
We propose to enhance text-to-image diffusion models by borrowing the strength of semantic understanding from large language models.<n>Our adapter module is superior to the stat-of-the-art models in terms of text-to-image generation quality and reliability.
arXiv Detail & Related papers (2025-02-06T12:17:35Z) - FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive Prompt Weighting [18.708185548091716]
FRAP is a simple, yet effective approach based on adaptively adjusting the per-token prompt weights.<n>We show FRAP generates images with significantly higher prompt-image alignment to prompts from complex datasets.<n>We also explore combining FRAP with prompt rewriting LLM to recover their degraded prompt-image alignment.
arXiv Detail & Related papers (2024-08-21T15:30:35Z) - Improving Compositional Attribute Binding in Text-to-Image Generative Models via Enhanced Text Embeddings [46.723653095494896]
We investigate compositional attribute binding failures in text-to-image generative models.<n>We show that imperfect text conditioning with CLIP text-encoder is one of the primary reasons behind the inability of these models to generate high-fidelity compositional scenes.<n>Our main finding shows that significant compositional improvements can be achieved without harming the model's FID score.
arXiv Detail & Related papers (2024-06-12T03:21:34Z) - Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification [54.96876797812238]
We present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts.
The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images.
Experiment results across various datasets and models confirm CODER's effectiveness.
arXiv Detail & Related papers (2024-04-27T02:04:36Z) - TextCraftor: Your Text Encoder Can be Image Quality Controller [65.27457900325462]
Diffusion-based text-to-image generative models, e.g., Stable Diffusion, have revolutionized the field of content generation.
We propose a proposed fine-tuning approach, TextCraftor, to enhance the performance of text-to-image diffusion models.
arXiv Detail & Related papers (2024-03-27T19:52:55Z) - Fine-tuning CLIP Text Encoders with Two-step Paraphrasing [83.3736789315201]
We introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases.
Our model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks.
arXiv Detail & Related papers (2024-02-23T06:11:50Z) - LightCLIP: Learning Multi-Level Interaction for Lightweight
Vision-Language Models [45.672539931681065]
We propose a multi-level interaction paradigm for training lightweight CLIP models.
An auxiliary fusion module injecting unmasked image embedding into masked text embedding is proposed.
arXiv Detail & Related papers (2023-12-01T15:54:55Z) - Enhancing Diffusion Models with Text-Encoder Reinforcement Learning [63.41513909279474]
Text-to-image diffusion models are typically trained to optimize the log-likelihood objective.
Recent research addresses this issue by refining the diffusion U-Net using human rewards through reinforcement learning or direct backpropagation.
We demonstrate that by finetuning the text encoder through reinforcement learning, we can enhance the text-image alignment of the results.
arXiv Detail & Related papers (2023-11-27T09:39:45Z) - Language-Driven Anchors for Zero-Shot Adversarial Robustness [25.160195547250655]
We propose a Language-driven, Anchor-based Adversarial Training strategy.
By leveraging the semantic consistency of the text encoders, LAAT aims to enhance the adversarial robustness of the image model.
We show that LAAT significantly improves zero-shot adversarial robustness over state-of-the-art methods.
arXiv Detail & Related papers (2023-01-30T17:34:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.