Related papers: Diffusion Model with Cross Attention as an Inductive Bias for Disentanglement

Diffusion Model with Cross Attention as an Inductive Bias for Disentanglement

URL: http://arxiv.org/abs/2402.09712v2
Date: Wed, 12 Jun 2024 15:20:36 GMT
Title: Diffusion Model with Cross Attention as an Inductive Bias for Disentanglement
Authors: Tao Yang, Cuiling Lan, Yan Lu, Nanning zheng,
Abstract summary: Disentangled representation learning strives to extract the intrinsic factors within observed data. We introduce a new perspective and framework, demonstrating that diffusion models with cross-attention can serve as a powerful inductive bias. This is the first work to reveal the potent disentanglement capability of diffusion models with cross-attention, requiring no complex designs.
Score: 58.9768112704998
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Disentangled representation learning strives to extract the intrinsic factors within observed data. Factorizing these representations in an unsupervised manner is notably challenging and usually requires tailored loss functions or specific structural designs. In this paper, we introduce a new perspective and framework, demonstrating that diffusion models with cross-attention can serve as a powerful inductive bias to facilitate the learning of disentangled representations. We propose to encode an image to a set of concept tokens and treat them as the condition of the latent diffusion for image reconstruction, where cross-attention over the concept tokens is used to bridge the interaction between the encoder and diffusion. Without any additional regularization, this framework achieves superior disentanglement performance on the benchmark datasets, surpassing all previous methods with intricate designs. We have conducted comprehensive ablation studies and visualization analysis, shedding light on the functioning of this model. This is the first work to reveal the potent disentanglement capability of diffusion models with cross-attention, requiring no complex designs. We anticipate that our findings will inspire more investigation on exploring diffusion for disentangled representation learning towards more sophisticated data analysis and understanding.

Related papers

Provable Maximum Entropy Manifold Exploration via Diffusion Models [58.89696361871563]
Exploration is critical for solving real-world decision-making problems such as scientific discovery.<n>We introduce a novel framework that casts exploration as entropy over approximate data manifold implicitly defined by a pre-trained diffusion model.<n>We develop an algorithm based on mirror descent that solves the exploration problem as sequential fine-tuning of a pre-trained diffusion model.
arXiv Detail & Related papers (2025-06-18T11:59:15Z)
G4Seg: Generation for Inexact Segmentation Refinement with Diffusion Models [38.44872934965588]
This paper considers the problem of utilizing a large-scale text-to-image model to tackle the Inexact diffusion (IS) task.<n>We exploit the pattern discrepancies between original images and mask-conditional generated images to facilitate a coarse-to-fine segmentation refinement.
arXiv Detail & Related papers (2025-06-02T11:05:28Z)
Critical Iterative Denoising: A Discrete Generative Model Applied to Graphs [52.50288418639075]
We propose a novel framework called Iterative Denoising, which simplifies discrete diffusion and circumvents the issue by assuming conditional independence across time. Our empirical evaluations demonstrate that the proposed method significantly outperforms existing discrete diffusion baselines in graph generation tasks.
arXiv Detail & Related papers (2025-03-27T15:08:58Z)
Oscillation Inversion: Understand the structure of Large Flow Model through the Lens of Inversion Method [60.88467353578118]
We show that a fixed-point-inspired iterative approach to invert real-world images does not achieve convergence, instead oscillating between distinct clusters. We introduce a simple and fast distribution transfer technique that facilitates image enhancement, stroke-based recoloring, as well as visual prompt-guided image editing.
arXiv Detail & Related papers (2024-11-17T17:45:37Z)
Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models. We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space. These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z)
Diffusion Models in Low-Level Vision: A Survey [82.77962165415153]
diffusion model-based solutions have emerged as widely acclaimed for their ability to produce samples of superior quality and diversity. We present three generic diffusion modeling frameworks and explore their correlations with other deep generative models. We summarize extended diffusion models applied in other tasks, including medical, remote sensing, and video scenarios.
arXiv Detail & Related papers (2024-06-17T01:49:27Z)
DiffusionPID: Interpreting Diffusion via Partial Information Decomposition [24.83767778658948]
We apply information-theoretic principles to decompose the input text prompt into its elementary components. We analyze how individual tokens and their interactions shape the generated image. We show that PID is a potent tool for evaluating and diagnosing text-to-image diffusion models.
arXiv Detail & Related papers (2024-06-07T18:17:17Z)
Explaining generative diffusion models via visual analysis for interpretable decision-making process [28.552283701883766]
We propose the three research questions to interpret the diffusion process from the perspective of the visual concepts generated by the model. We devise tools for visualizing the diffusion process and answering the aforementioned research questions to render the diffusion process human-understandable.
arXiv Detail & Related papers (2024-02-16T02:12:20Z)
Bridging Generative and Discriminative Models for Unified Visual Perception with Diffusion Priors [56.82596340418697]
We propose a simple yet effective framework comprising a pre-trained Stable Diffusion (SD) model containing rich generative priors, a unified head (U-head) capable of integrating hierarchical representations, and an adapted expert providing discriminative priors. Comprehensive investigations unveil potential characteristics of Vermouth, such as varying granularity of perception concealed in latent variables at distinct time steps and various U-net stages. The promising results demonstrate the potential of diffusion models as formidable learners, establishing their significance in furnishing informative and robust visual representations.
arXiv Detail & Related papers (2024-01-29T10:36:57Z)
Directional diffusion models for graph representation learning [9.457273750874357]
We propose a new class of models called it directional diffusion models These models incorporate data-dependent, anisotropic, and directional noises in the forward diffusion process. We conduct extensive experiments on 12 publicly available datasets, focusing on two distinct graph representation learning tasks.
arXiv Detail & Related papers (2023-06-22T21:27:48Z)
DiffusionSeg: Adapting Diffusion Towards Unsupervised Object Discovery [20.787180028571694]
DiffusionSeg is a synthesis-exploitation framework containing two-stage strategies. We synthesize abundant images, and propose a novel training-free AttentionCut to obtain masks in the first stage. In the second exploitation stage, to bridge the structural gap, we use the inversion technique, to map the given image back to diffusion features.
arXiv Detail & Related papers (2023-03-17T07:47:55Z)
Proactive Pseudo-Intervention: Causally Informed Contrastive Learning For Interpretable Vision Models [103.64435911083432]
We present a novel contrastive learning strategy called it Proactive Pseudo-Intervention (PPI) PPI leverages proactive interventions to guard against image features with no causal relevance. We also devise a novel causally informed salience mapping module to identify key image pixels to intervene, and show it greatly facilitates model interpretability.
arXiv Detail & Related papers (2020-12-06T20:30:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.