Bridging Generative and Discriminative Models for Unified Visual
Perception with Diffusion Priors
- URL: http://arxiv.org/abs/2401.16459v1
- Date: Mon, 29 Jan 2024 10:36:57 GMT
- Title: Bridging Generative and Discriminative Models for Unified Visual
Perception with Diffusion Priors
- Authors: Shiyin Dong, Mingrui Zhu, Kun Cheng, Nannan Wang, Xinbo Gao
- Abstract summary: We propose a simple yet effective framework comprising a pre-trained Stable Diffusion (SD) model containing rich generative priors, a unified head (U-head) capable of integrating hierarchical representations, and an adapted expert providing discriminative priors.
Comprehensive investigations unveil potential characteristics of Vermouth, such as varying granularity of perception concealed in latent variables at distinct time steps and various U-net stages.
The promising results demonstrate the potential of diffusion models as formidable learners, establishing their significance in furnishing informative and robust visual representations.
- Score: 56.82596340418697
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The remarkable prowess of diffusion models in image generation has spurred
efforts to extend their application beyond generative tasks. However, a
persistent challenge exists in lacking a unified approach to apply diffusion
models to visual perception tasks with diverse semantic granularity
requirements. Our purpose is to establish a unified visual perception
framework, capitalizing on the potential synergies between generative and
discriminative models. In this paper, we propose Vermouth, a simple yet
effective framework comprising a pre-trained Stable Diffusion (SD) model
containing rich generative priors, a unified head (U-head) capable of
integrating hierarchical representations, and an adapted expert providing
discriminative priors. Comprehensive investigations unveil potential
characteristics of Vermouth, such as varying granularity of perception
concealed in latent variables at distinct time steps and various U-net stages.
We emphasize that there is no necessity for incorporating a heavyweight or
intricate decoder to transform diffusion models into potent representation
learners. Extensive comparative evaluations against tailored discriminative
models showcase the efficacy of our approach on zero-shot sketch-based image
retrieval (ZS-SBIR), few-shot classification, and open-vocabulary semantic
segmentation tasks. The promising results demonstrate the potential of
diffusion models as formidable learners, establishing their significance in
furnishing informative and robust visual representations.
Related papers
- Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models [39.127620891450526]
We introduce a unified, versatile, diffusion-based framework, Diff-2-in-1, to handle both multi-modal data generation and dense visual perception.
We further enhance discriminative visual perception via multi-modal generation, by utilizing the denoising network to create multi-modal data that mirror the distribution of the original training set.
arXiv Detail & Related papers (2024-11-07T18:59:53Z) - Unleashing the Potential of the Diffusion Model in Few-shot Semantic Segmentation [56.87049651707208]
Few-shot Semantic has evolved into In-context tasks, morphing into a crucial element in assessing generalist segmentation models.
Our initial focus lies in understanding how to facilitate interaction between the query image and the support image, resulting in the proposal of a KV fusion method within the self-attention framework.
Based on our analysis, we establish a simple and effective framework named DiffewS, maximally retaining the original Latent Diffusion Model's generative framework.
arXiv Detail & Related papers (2024-10-03T10:33:49Z) - Diffusion Models in Low-Level Vision: A Survey [82.77962165415153]
diffusion model-based solutions have emerged as widely acclaimed for their ability to produce samples of superior quality and diversity.
We present three generic diffusion modeling frameworks and explore their correlations with other deep generative models.
We summarize extended diffusion models applied in other tasks, including medical, remote sensing, and video scenarios.
arXiv Detail & Related papers (2024-06-17T01:49:27Z) - Diffusion Model with Cross Attention as an Inductive Bias for Disentanglement [58.9768112704998]
Disentangled representation learning strives to extract the intrinsic factors within observed data.
We introduce a new perspective and framework, demonstrating that diffusion models with cross-attention can serve as a powerful inductive bias.
This is the first work to reveal the potent disentanglement capability of diffusion models with cross-attention, requiring no complex designs.
arXiv Detail & Related papers (2024-02-15T05:07:54Z) - DiffAugment: Diffusion based Long-Tailed Visual Relationship Recognition [43.01467525231004]
We introduce DiffAugment -- a method which augments the tail classes in the linguistic space by making use of WordNet.
We demonstrate the effectiveness of hardness-aware diffusion in generating visual embeddings for the tail classes.
We also propose a novel subject and object based seeding strategy for diffusion sampling which improves the discriminative capability of the generated visual embeddings.
arXiv Detail & Related papers (2024-01-01T21:20:43Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - Diffusion Models for Image Restoration and Enhancement -- A
Comprehensive Survey [96.99328714941657]
We present a comprehensive review of recent diffusion model-based methods on image restoration.
We classify and emphasize the innovative designs using diffusion models for both IR and blind/real-world IR.
We propose five potential and challenging directions for the future research of diffusion model-based IR.
arXiv Detail & Related papers (2023-08-18T08:40:38Z) - InfoDiffusion: Representation Learning Using Information Maximizing
Diffusion Models [35.566528358691336]
InfoDiffusion is an algorithm that augments diffusion models with low-dimensional latent variables.
InfoDiffusion relies on a learning objective regularized with the mutual information between observed and hidden variables.
We find that InfoDiffusion learns disentangled and human-interpretable latent representations that are competitive with state-of-the-art generative and contrastive methods.
arXiv Detail & Related papers (2023-06-14T21:48:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.