Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images
with Free Attention Masks
- URL: http://arxiv.org/abs/2308.06739v1
- Date: Sun, 13 Aug 2023 10:07:46 GMT
- Title: Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images
with Free Attention Masks
- Authors: David Junhao Zhang, Mutian Xu, Chuhui Xue, Wenqing Zhang, Xiaoguang
Han, Song Bai, Mike Zheng Shou
- Abstract summary: Text-to-image diffusion models have shown great potential for benefiting image recognition.
Although promising, there has been inadequate exploration dedicated to unsupervised learning on diffusion-generated images.
We introduce customized solutions by fully exploiting the aforementioned free attention masks.
- Score: 64.67735676127208
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the rapid advancement of unsupervised learning in visual
representation, it requires training on large-scale datasets that demand costly
data collection, and pose additional challenges due to concerns regarding data
privacy. Recently, synthetic images generated by text-to-image diffusion
models, have shown great potential for benefiting image recognition. Although
promising, there has been inadequate exploration dedicated to unsupervised
learning on diffusion-generated images. To address this, we start by uncovering
that diffusion models' cross-attention layers inherently provide
annotation-free attention masks aligned with corresponding text inputs on
generated images. We then investigate the problems of three prevalent
unsupervised learning techniques ( i.e., contrastive learning, masked modeling,
and vision-language pretraining) and introduce customized solutions by fully
exploiting the aforementioned free attention masks. Our approach is validated
through extensive experiments that show consistent improvements in baseline
models across various downstream tasks, including image classification,
detection, segmentation, and image-text retrieval. By utilizing our method, it
is possible to close the performance gap between unsupervised pretraining on
synthetic data and real-world scenarios.
Related papers
- Towards Reliable Verification of Unauthorized Data Usage in Personalized Text-to-Image Diffusion Models [23.09033991200197]
New personalization techniques have been proposed to customize the pre-trained base models for crafting images with specific themes or styles.
Such a lightweight solution poses a new concern regarding whether the personalized models are trained from unauthorized data.
We introduce SIREN, a novel methodology to proactively trace unauthorized data usage in black-box personalized text-to-image diffusion models.
arXiv Detail & Related papers (2024-10-14T12:29:23Z) - Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation [60.943159830780154]
We introduce Bounded Attention, a training-free method for bounding the information flow in the sampling process.
We demonstrate that our method empowers the generation of multiple subjects that better align with given prompts and layouts.
arXiv Detail & Related papers (2024-03-25T17:52:07Z) - Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention [62.671435607043875]
Research indicates that text-to-image diffusion models replicate images from their training data, raising tremendous concerns about potential copyright infringement and privacy risks.
We reveal that during memorization, the cross-attention tends to focus disproportionately on the embeddings of specific tokens.
We introduce an innovative approach to detect and mitigate memorization in diffusion models.
arXiv Detail & Related papers (2024-03-17T01:27:00Z) - Masking Improves Contrastive Self-Supervised Learning for ConvNets, and Saliency Tells You Where [63.61248884015162]
We aim to alleviate the burden of including masking operation into the contrastive-learning framework for convolutional neural networks.
We propose to explicitly take the saliency constraint into consideration in which the masked regions are more evenly distributed among the foreground and background.
arXiv Detail & Related papers (2023-09-22T09:58:38Z) - The Devil is in the Frequency: Geminated Gestalt Autoencoder for
Self-Supervised Visual Pre-Training [13.087987450384036]
We present a new Masked Image Modeling (MIM), termed Geminated Autoencoder (Ge$2$-AE) for visual pre-training.
Specifically, we equip our model with geminated decoders in charge of reconstructing image contents from both pixel and frequency space.
arXiv Detail & Related papers (2022-04-18T09:22:55Z) - Intelligent Masking: Deep Q-Learning for Context Encoding in Medical
Image Analysis [48.02011627390706]
We develop a novel self-supervised approach that occludes targeted regions to improve the pre-training procedure.
We show that training the agent against the prediction model can significantly improve the semantic features extracted for downstream classification tasks.
arXiv Detail & Related papers (2022-03-25T19:05:06Z) - Proactive Pseudo-Intervention: Causally Informed Contrastive Learning
For Interpretable Vision Models [103.64435911083432]
We present a novel contrastive learning strategy called it Proactive Pseudo-Intervention (PPI)
PPI leverages proactive interventions to guard against image features with no causal relevance.
We also devise a novel causally informed salience mapping module to identify key image pixels to intervene, and show it greatly facilitates model interpretability.
arXiv Detail & Related papers (2020-12-06T20:30:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.