CAM-Seg: A Continuous-valued Embedding Approach for Semantic Image Generation
- URL: http://arxiv.org/abs/2503.15617v1
- Date: Wed, 19 Mar 2025 18:06:54 GMT
- Title: CAM-Seg: A Continuous-valued Embedding Approach for Semantic Image Generation
- Authors: Masud Ahmed, Zahid Hasan, Syed Arefinul Haque, Abu Zaher Md Faridee, Sanjay Purushotham, Suya You, Nirmalya Roy,
- Abstract summary: Autoencoder accuracy on segmentation mask using quantized embeddings is 8% lower than continuous-valued embeddings.<n>We propose a continuous-valued embedding framework for semantic segmentation.<n>Our approach eliminates the need for discrete latent representations while preserving fine-grained semantic details.
- Score: 11.170848285659572
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional transformer-based semantic segmentation relies on quantized embeddings. However, our analysis reveals that autoencoder accuracy on segmentation mask using quantized embeddings (e.g. VQ-VAE) is 8% lower than continuous-valued embeddings (e.g. KL-VAE). Motivated by this, we propose a continuous-valued embedding framework for semantic segmentation. By reformulating semantic mask generation as a continuous image-to-embedding diffusion process, our approach eliminates the need for discrete latent representations while preserving fine-grained spatial and semantic details. Our key contribution includes a diffusion-guided autoregressive transformer that learns a continuous semantic embedding space by modeling long-range dependencies in image features. Our framework contains a unified architecture combining a VAE encoder for continuous feature extraction, a diffusion-guided transformer for conditioned embedding generation, and a VAE decoder for semantic mask reconstruction. Our setting facilitates zero-shot domain adaptation capabilities enabled by the continuity of the embedding space. Experiments across diverse datasets (e.g., Cityscapes and domain-shifted variants) demonstrate state-of-the-art robustness to distribution shifts, including adverse weather (e.g., fog, snow) and viewpoint variations. Our model also exhibits strong noise resilience, achieving robust performance ($\approx$ 95% AP compared to baseline) under gaussian noise, moderate motion blur, and moderate brightness/contrast variations, while experiencing only a moderate impact ($\approx$ 90% AP compared to baseline) from 50% salt and pepper noise, saturation and hue shifts. Code available: https://github.com/mahmed10/CAMSS.git
Related papers
- High-Fidelity Image Inpainting with Multimodal Guided GAN Inversion [15.202130790708747]
We propose a novel GAN inversion approach, dubbed MMInvertFill, for image inpainting.
MMInvertFill contains primarily a multimodal guided encoder with a pre-modulation and a GAN generator with F&W+ latent space.
We show that our MMInvertFill qualitatively and quantitatively outperforms other state-of-the-arts.
arXiv Detail & Related papers (2025-04-17T10:58:45Z) - FreSca: Unveiling the Scaling Space in Diffusion Models [52.20473039489599]
Diffusion models offer impressive controllability for image tasks, primarily through noise predictions that encode task-specific information and guidance enabling adjustable scaling.
We investigate this space, starting with inversion-based editing where the difference between conditional/unconditional noise predictions carries key semantic information.
Our core contribution stems from a Fourier analysis of noise predictions, revealing that its low- and high-frequency components evolve differently throughout diffusion.
Based on this insight, we introduce FreSca, a straightforward method that applies guidance scaling independently to different frequency bands in the Fourier domain.
arXiv Detail & Related papers (2025-04-02T22:03:11Z) - Fine-grained Image-to-LiDAR Contrastive Distillation with Visual Foundation Models [55.99654128127689]
Visual Foundation Models (VFMs) are used to generate semantic labels for weakly-supervised pixel-to-point contrastive distillation.<n>We adapt sampling probabilities of points to address imbalances in spatial distribution and category frequency.<n>Our approach consistently surpasses existing image-to-LiDAR contrastive distillation methods in downstream tasks.
arXiv Detail & Related papers (2024-05-23T07:48:19Z) - CLR-Face: Conditional Latent Refinement for Blind Face Restoration Using
Score-Based Diffusion Models [57.9771859175664]
Recent generative-prior-based methods have shown promising blind face restoration performance.
Generating fine-grained facial details faithful to inputs remains a challenging problem.
We introduce a diffusion-based-prior inside a VQGAN architecture that focuses on learning the distribution over uncorrupted latent embeddings.
arXiv Detail & Related papers (2024-02-08T23:51:49Z) - Improving Diffusion-Based Image Synthesis with Context Prediction [49.186366441954846]
Existing diffusion models mainly try to reconstruct input image from a corrupted one with a pixel-wise or feature-wise constraint along spatial axes.
We propose ConPreDiff to improve diffusion-based image synthesis with context prediction.
Our ConPreDiff consistently outperforms previous methods and achieves a new SOTA text-to-image generation results on MS-COCO, with a zero-shot FID score of 6.21.
arXiv Detail & Related papers (2024-01-04T01:10:56Z) - f-DM: A Multi-stage Diffusion Model via Progressive Signal
Transformation [56.04628143914542]
Diffusion models (DMs) have recently emerged as SoTA tools for generative modeling in various domains.
We propose f-DM, a generalized family of DMs which allows progressive signal transformation.
We apply f-DM in image generation tasks with a range of functions, including down-sampling, blurring, and learned transformations.
arXiv Detail & Related papers (2022-10-10T18:49:25Z) - Lossy Image Compression with Conditional Diffusion Models [25.158390422252097]
This paper outlines an end-to-end optimized lossy image compression framework using diffusion generative models.
In contrast to VAE-based neural compression, where the (mean) decoder is a deterministic neural network, our decoder is a conditional diffusion model.
Our approach yields stronger reported FID scores than the GAN-based model, while also yielding competitive performance with VAE-based models in several distortion metrics.
arXiv Detail & Related papers (2022-09-14T21:53:27Z) - Semantic Image Synthesis via Diffusion Models [159.4285444680301]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.
Recent work on semantic image synthesis mainly follows the emphde facto Generative Adversarial Nets (GANs)
arXiv Detail & Related papers (2022-06-30T18:31:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.