Related papers: Toward a Diffusion-Based Generalist for Dense Vision Tasks

Toward a Diffusion-Based Generalist for Dense Vision Tasks

URL: http://arxiv.org/abs/2407.00503v1
Date: Sat, 29 Jun 2024 17:57:22 GMT
Title: Toward a Diffusion-Based Generalist for Dense Vision Tasks
Authors: Yue Fan, Yongqin Xian, Xiaohua Zhai, Alexander Kolesnikov, Muhammad Ferjad Naeem, Bernt Schiele, Federico Tombari,
Abstract summary: Recent works have shown image itself can be used as a natural interface for general-purpose visual perception. We propose to perform diffusion in pixel space and provide a recipe for finetuning pre-trained text-to-image diffusion models for dense vision tasks. In experiments, we evaluate our method on four different types of tasks and show competitive performance to the other vision generalists.
Score: 141.03236279493686
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Building generalized models that can solve many computer vision tasks simultaneously is an intriguing direction. Recent works have shown image itself can be used as a natural interface for general-purpose visual perception and demonstrated inspiring results. In this paper, we explore diffusion-based vision generalists, where we unify different types of dense prediction tasks as conditional image generation and re-purpose pre-trained diffusion models for it. However, directly applying off-the-shelf latent diffusion models leads to a quantization issue. Thus, we propose to perform diffusion in pixel space and provide a recipe for finetuning pre-trained text-to-image diffusion models for dense vision tasks. In experiments, we evaluate our method on four different types of tasks and show competitive performance to the other vision generalists.

Related papers

Visual Bridge: Universal Visual Perception Representations Generating [27.034175361589572]
We propose a universal visual perception framework based on flow matching that can generate diverse visual representations across multiple tasks.<n>Our approach formulates the process as a universal flow-matching problem from image patch tokens to task-specific representations.<n>Our model achieves competitive performance in both zero-shot and fine-tuned settings, outperforming prior generalist and several specialist models.
arXiv Detail & Related papers (2025-11-11T06:25:30Z)
Generalist Forecasting with Frozen Video Models via Latent Diffusion [35.96406989431198]
We show a strong correlation between a vision model's perceptual ability and its generalist forecasting performance over short time horizons.<n>Our results highlight the value of bridging representation learning and generative modeling for temporally grounded video understanding.
arXiv Detail & Related papers (2025-07-18T14:14:19Z)
USP: Unified Self-Supervised Pretraining for Image Generation and Understanding [15.717333276867462]
Unified Self-supervised Pretraining (USP) is a framework that initializes diffusion models via masked latent modeling in a Variational Autoencoder (VAE) latent space. USP achieves comparable performance in understanding tasks while significantly improving the convergence speed and generation quality of diffusion models.
arXiv Detail & Related papers (2025-03-08T09:01:03Z)
DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks [61.16389024252561]
We develop a robust generalist perception model capable of addressing multiple tasks under constraints of computational resources and limited training data.<n>We leverage text-to-image diffusion models pre-trained on billions of images and successfully introduce our DICEPTION, a visual generalist model.<n> Exhaustive evaluations demonstrate that DICEPTION effectively tackles diverse perception tasks, even achieving performance comparable to SOTA single-task specialist models.
arXiv Detail & Related papers (2025-02-24T13:51:06Z)
From Image to Video: An Empirical Study of Diffusion Representations [37.09795196423048]
Diffusion models have revolutionized generative modeling, enabling unprecedented realism in image and video synthesis. This work marks the first direct comparison of video and image diffusion objectives for visual understanding, offering insights into the role of temporal information in representation learning.
arXiv Detail & Related papers (2025-02-10T19:53:46Z)
Diffusion Models in Low-Level Vision: A Survey [82.77962165415153]
diffusion model-based solutions have emerged as widely acclaimed for their ability to produce samples of superior quality and diversity. We present three generic diffusion modeling frameworks and explore their correlations with other deep generative models. We summarize extended diffusion models applied in other tasks, including medical, remote sensing, and video scenarios.
arXiv Detail & Related papers (2024-06-17T01:49:27Z)
Explaining generative diffusion models via visual analysis for interpretable decision-making process [28.552283701883766]
We propose the three research questions to interpret the diffusion process from the perspective of the visual concepts generated by the model. We devise tools for visualizing the diffusion process and answering the aforementioned research questions to render the diffusion process human-understandable.
arXiv Detail & Related papers (2024-02-16T02:12:20Z)
Bridging Generative and Discriminative Models for Unified Visual Perception with Diffusion Priors [56.82596340418697]
We propose a simple yet effective framework comprising a pre-trained Stable Diffusion (SD) model containing rich generative priors, a unified head (U-head) capable of integrating hierarchical representations, and an adapted expert providing discriminative priors. Comprehensive investigations unveil potential characteristics of Vermouth, such as varying granularity of perception concealed in latent variables at distinct time steps and various U-net stages. The promising results demonstrate the potential of diffusion models as formidable learners, establishing their significance in furnishing informative and robust visual representations.
arXiv Detail & Related papers (2024-01-29T10:36:57Z)
Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks. We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception. Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z)
InstructDiffusion: A Generalist Modeling Interface for Vision Tasks [52.981128371910266]
We present InstructDiffusion, a framework for aligning computer vision tasks with human instructions. InstructDiffusion could handle a variety of vision tasks, including understanding tasks and generative tasks. It even exhibits the ability to handle unseen tasks and outperforms prior methods on novel datasets.
arXiv Detail & Related papers (2023-09-07T17:56:57Z)
Deceptive-NeRF/3DGS: Diffusion-Generated Pseudo-Observations for High-Quality Sparse-View Reconstruction [60.52716381465063]
We introduce Deceptive-NeRF/3DGS to enhance sparse-view reconstruction with only a limited set of input images. Specifically, we propose a deceptive diffusion model turning noisy images rendered from few-view reconstructions into high-quality pseudo-observations. Our system progressively incorporates diffusion-generated pseudo-observations into the training image sets, ultimately densifying the sparse input observations by 5 to 10 times.
arXiv Detail & Related papers (2023-05-24T14:00:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.