Related papers: SCP-Diff: Spatial-Categorical Joint Prior for Diffusion Based Semantic Image Synthesis

SCP-Diff: Spatial-Categorical Joint Prior for Diffusion Based Semantic Image Synthesis

URL: http://arxiv.org/abs/2403.09638v2
Date: Tue, 16 Jul 2024 12:40:17 GMT
Title: SCP-Diff: Spatial-Categorical Joint Prior for Diffusion Based Semantic Image Synthesis
Authors: Huan-ang Gao, Mingju Gao, Jiaju Li, Wenyi Li, Rong Zhi, Hao Tang, Hao Zhao,
Abstract summary: SCP-Diff sets new state-of-the-art results in SIS on Cityscapes, ADE20K and COCO-Stuff, yielding a FID as low as 10.53 on Cityscapes.
Score: 8.768077629120915
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Semantic image synthesis (SIS) shows good promises for sensor simulation. However, current best practices in this field, based on GANs, have not yet reached the desired level of quality. As latent diffusion models make significant strides in image generation, we are prompted to evaluate ControlNet, a notable method for its dense control capabilities. Our investigation uncovered two primary issues with its results: the presence of weird sub-structures within large semantic areas and the misalignment of content with the semantic mask. Through empirical study, we pinpointed the cause of these problems as a mismatch between the noised training data distribution and the standard normal prior applied at the inference stage. To address this challenge, we developed specific noise priors for SIS, encompassing spatial, categorical, and a novel spatial-categorical joint prior for inference. This approach, which we have named SCP-Diff, has set new state-of-the-art results in SIS on Cityscapes, ADE20K and COCO-Stuff, yielding a FID as low as 10.53 on Cityscapes. The code and models can be accessed via the project page.

Related papers

Why Settle for Mid: A Probabilistic Viewpoint to Spatial Relationship Alignment in Text-to-image Models [3.5999252362400993]
A prevalent issue in compositional generation is the misalignment of spatial relationships.<n>We introduce a novel evaluation metric designed to assess the alignment of 2D and 3D spatial relationships between text and image.<n>We also propose PoS-based Generation, an inference-time method that improves the alignment of 2D and 3D spatial relationships in T2I models without requiring fine-tuning.
arXiv Detail & Related papers (2025-06-29T22:41:27Z)
UrbanCraft: Urban View Extrapolation via Hierarchical Sem-Geometric Priors [10.706273062956507]
Urban scene reconstruction methods mainly focus on the Interpolated View Synthesis setting that synthesizes views close to training camera trajectory.<n>Previous methods have optimized it via image diffusion, but they fail to handle text-ambiguous or large unseen view angles.<n>We design UrbanCraft, which surmounts the Extrapolated View Synthesis problem using hierarchical sem-geometric representations serving as additional priors.
arXiv Detail & Related papers (2025-05-29T13:28:04Z)
Towards Robust and Realistic Human Pose Estimation via WiFi Signals [85.60557095666934]
WiFi-based human pose estimation is a challenging task that bridges discrete and subtle WiFi signals to human skeletons. This paper revisits this problem and reveals two critical yet overlooked issues: 1) cross-domain gap, i.e., due to significant variations between source-target domain pose distributions; and 2) structural fidelity gap, i.e., predicted skeletal poses manifest distorted topology. This paper fills these gaps by reformulating the task into a novel two-phase framework dubbed DT-Pose: Domain-consistent representation learning and Topology-constrained Pose decoding.
arXiv Detail & Related papers (2025-01-16T09:38:22Z)
Coarse-Fine Spectral-Aware Deformable Convolution For Hyperspectral Image Reconstruction [15.537910100051866]
We study the inverse problem of Coded Aperture Snapshot Spectral Imaging (CASSI) We propose Coarse-Fine Spectral-Aware Deformable Convolution Network (CFSDCN) Our CFSDCN significantly outperforms previous state-of-the-art (SOTA) methods on both simulated and real HSI datasets.
arXiv Detail & Related papers (2024-06-18T15:15:12Z)
3D Human Pose Analysis via Diffusion Synthesis [65.268245109828]
PADS represents the first diffusion-based framework for tackling general 3D human pose analysis within the inverse problem framework. Its performance has been validated on different benchmarks, signaling the adaptability and robustness of this pipeline.
arXiv Detail & Related papers (2024-01-17T02:59:34Z)
JoReS-Diff: Joint Retinex and Semantic Priors in Diffusion Model for Low-light Image Enhancement [69.6035373784027]
Low-light image enhancement (LLIE) has achieved promising performance by employing conditional diffusion models. Previous methods may neglect the importance of a sufficient formulation of task-specific condition strategy. We propose JoReS-Diff, a novel approach that incorporates Retinex- and semantic-based priors as the additional pre-processing condition.
arXiv Detail & Related papers (2023-12-20T08:05:57Z)
Denoising Diffusion Semantic Segmentation with Mask Prior Modeling [61.73352242029671]
We propose to ameliorate the semantic segmentation quality of existing discriminative approaches with a mask prior modeled by a denoising diffusion generative model. We evaluate the proposed prior modeling with several off-the-shelf segmentors, and our experimental results on ADE20K and Cityscapes demonstrate that our approach could achieve competitively quantitative performance.
arXiv Detail & Related papers (2023-06-02T17:47:01Z)
Dual Stage Stylization Modulation for Domain Generalized Semantic Segmentation [39.35385886870209]
We introduce a dual-stage Feature Transform (dFT) layer within the Adversarial Semantic Hallucination+ framework. By leveraging semantic information for each pixel, our approach adaptively adjusts the pixel-wise hallucination strength. We validate the effectiveness of our proposed method through comprehensive experiments on publicly available semantic segmentation benchmark datasets.
arXiv Detail & Related papers (2023-04-18T23:54:20Z)
Empowering Diffusion Models on the Embedding Space for Text Generation [38.664533078347304]
We study the optimization challenges encountered with both the embedding space and the denoising model. Data distribution is learnable for embeddings, which may lead to the collapse of the embedding space and unstable training. Based on the above analysis, we propose Difformer, an embedding diffusion model based on Transformer.
arXiv Detail & Related papers (2022-12-19T12:44:25Z)
Semantic Image Synthesis via Diffusion Models [174.24523061460704]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks. Recent work on semantic image synthesis mainly follows the de facto GAN-based approaches. We propose a novel framework based on DDPM for semantic image synthesis.
arXiv Detail & Related papers (2022-06-30T18:31:51Z)
IGAN: Inferent and Generative Adversarial Networks [0.0]
IGAN learns both a generative and an inference model on a complex high dimensional data distribution. It extends the traditional GAN framework with inference by rewriting the adversarial strategy in both the image and the latent space.
arXiv Detail & Related papers (2021-09-27T21:48:35Z)
Recent Developments Combining Ensemble Smoother and Deep Generative Networks for Facies History Matching [58.720142291102135]
This research project focuses on the use of autoencoders networks to construct a continuous parameterization for facies models. We benchmark seven different formulations, including VAE, generative adversarial network (GAN), Wasserstein GAN, variational auto-encoding GAN, principal component analysis (PCA) with cycle GAN, PCA with transfer style network and VAE with style loss.
arXiv Detail & Related papers (2020-05-08T21:32:42Z)
Deep Semantic Matching with Foreground Detection and Cycle-Consistency [103.22976097225457]
We address weakly supervised semantic matching based on a deep network. We explicitly estimate the foreground regions to suppress the effect of background clutter. We develop cycle-consistent losses to enforce the predicted transformations across multiple images to be geometrically plausible and consistent.
arXiv Detail & Related papers (2020-03-31T22:38:09Z)
Peeking into occluded joints: A novel framework for crowd pose estimation [88.56203133287865]
OPEC-Net is an Image-Guided Progressive GCN module that estimates invisible joints from an inference perspective. OCPose is the most complex Occluded Pose dataset with respect to average IoU between adjacent instances.
arXiv Detail & Related papers (2020-03-23T19:32:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.