Negative Token Merging: Image-based Adversarial Feature Guidance
- URL: http://arxiv.org/abs/2412.01339v2
- Date: Thu, 05 Dec 2024 18:43:25 GMT
- Title: Negative Token Merging: Image-based Adversarial Feature Guidance
- Authors: Jaskirat Singh, Lindsey Li, Weijia Shi, Ranjay Krishna, Yejin Choi, Pang Wei Koh, Michael F. Cohen, Stephen Gould, Liang Zheng, Luke Zettlemoyer,
- Abstract summary: We introduce negative token merging (NegToMe) to perform adversarial guidance through images.<n>NegToMe selectively pushes apart matching visual features between reference and generated images during the reverse diffusion process.<n>It significantly enhances output diversity and reduces visual similarity to copyrighted content by 34.57%.
- Score: 114.65069052244088
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-based adversarial guidance using a negative prompt has emerged as a widely adopted approach to steer diffusion models away from producing undesired concepts. While useful, performing adversarial guidance using text alone can be insufficient to capture complex visual concepts or avoid specific visual elements like copyrighted characters. In this paper, for the first time we explore an alternate modality in this direction by performing adversarial guidance directly using visual features from a reference image or other images in a batch. We introduce negative token merging (NegToMe), a simple but effective training-free approach which performs adversarial guidance through images by selectively pushing apart matching visual features between reference and generated images during the reverse diffusion process. By simply adjusting the used reference, NegToMe enables a diverse range of applications. Notably, when using other images in same batch as reference, we find that NegToMe significantly enhances output diversity (e.g., racial, gender, visual) by guiding features of each image away from others. Similarly, when used w.r.t. copyrighted reference images, NegToMe reduces visual similarity to copyrighted content by 34.57%. NegToMe is simple to implement using just few-lines of code, uses only marginally higher (<4%) inference time and is compatible with different diffusion architectures, including those like Flux, which don't natively support the use of a negative prompt. Code is available at https://negtome.github.io
Related papers
- StyleKeeper: Prevent Content Leakage using Negative Visual Query Guidance [29.94258634899353]
We propose negative visual query guidance (NVQG) to reduce the transfer of unwanted contents.<n>NVQG employs negative score by intentionally content leakage scenarios that swap queries instead of key and values of self-attention layers from visual style prompts.<n>Our method demonstrates superiority over existing approaches, reflecting the style of the references, and ensuring that resulting images match the text prompts.
arXiv Detail & Related papers (2025-10-08T09:50:34Z) - Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning [71.14084801851381]
Change captioning aims to succinctly describe the semantic change between a pair of similar images.
Most existing methods directly capture the difference between them, which risk obtaining error-prone difference features.
We propose a distractors-immune representation learning network that correlates the corresponding channels of two image representations.
arXiv Detail & Related papers (2024-07-16T13:00:33Z) - Optimizing Negative Prompts for Enhanced Aesthetics and Fidelity in Text-To-Image Generation [1.4138057640459576]
We propose NegOpt, a novel method for optimizing negative prompt generation toward enhanced image generation.
Our combined approach results in a substantial increase of 25% in Inception Score compared to other approaches.
arXiv Detail & Related papers (2024-03-12T12:44:34Z) - Reference-based Motion Blur Removal: Learning to Utilize Sharpness in
the Reference Image [29.52731707976695]
A typical setting is deburring an image using a nearby sharp image in a video sequence.
This paper proposes a better method to use the information present in a reference image.
Our method can be integrated into pre-existing networks designed for single image deblurring.
arXiv Detail & Related papers (2023-07-06T09:24:55Z) - High-Fidelity Guided Image Synthesis with Latent Diffusion Models [50.39294302741698]
The proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores.
Human user study results show that the proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores.
arXiv Detail & Related papers (2022-11-30T15:43:20Z) - Exploring Negatives in Contrastive Learning for Unpaired Image-to-Image
Translation [12.754320302262533]
We introduce a new negative Pruning technology for Unpaired image-to-image Translation (PUT) by sparsifying and ranking the patches.
The proposed algorithm is efficient, flexible and enables the model to learn essential information between corresponding patches stably.
arXiv Detail & Related papers (2022-04-23T08:31:18Z) - Modulated Contrast for Versatile Image Synthesis [60.304183493234376]
MoNCE is a versatile metric that introduces image contrast to learn a calibrated metric for the perception of multifaceted inter-image distances.
We introduce optimal transport in MoNCE to modulate the pushing force of negative samples collaboratively across multiple contrastive objectives.
arXiv Detail & Related papers (2022-03-17T14:03:46Z) - Robust Contrastive Learning Using Negative Samples with Diminished
Semantics [23.38896719740166]
We show that by generating carefully designed negative samples, contrastive learning can learn more robust representations.
We develop two methods, texture-based and patch-based augmentations, to generate negative samples.
We also analyze our method and the generated texture-based samples, showing that texture features are indispensable in classifying particular ImageNet classes.
arXiv Detail & Related papers (2021-10-27T05:38:00Z) - Seed the Views: Hierarchical Semantic Alignment for Contrastive
Representation Learning [116.91819311885166]
We propose a hierarchical semantic alignment strategy via expanding the views generated by a single image to textbfCross-samples and Multi-level representation.
Our method, termed as CsMl, has the ability to integrate multi-level visual representations across samples in a robust way.
arXiv Detail & Related papers (2020-12-04T17:26:24Z) - Contrastive Learning for Unpaired Image-to-Image Translation [64.47477071705866]
In image-to-image translation, each patch in the output should reflect the content of the corresponding patch in the input, independent of domain.
We propose a framework based on contrastive learning to maximize mutual information between the two.
We demonstrate that our framework enables one-sided translation in the unpaired image-to-image translation setting, while improving quality and reducing training time.
arXiv Detail & Related papers (2020-07-30T17:59:58Z) - Whitening for Self-Supervised Representation Learning [129.57407186848917]
We propose a new loss function for self-supervised representation learning (SSL) based on the whitening of latent-space features.
Our solution does not require asymmetric networks and it is conceptually simple.
arXiv Detail & Related papers (2020-07-13T12:33:25Z) - Watching the World Go By: Representation Learning from Unlabeled Videos [78.22211989028585]
Recent single image unsupervised representation learning techniques show remarkable success on a variety of tasks.
In this paper, we argue that videos offer this natural augmentation for free.
We propose Video Noise Contrastive Estimation, a method for using unlabeled video to learn strong, transferable single image representations.
arXiv Detail & Related papers (2020-03-18T00:07:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.