Related papers: JoDiffusion: Jointly Diffusing Image with Pixel-Level Annotations for Semantic Segmentation Promotion

JoDiffusion: Jointly Diffusing Image with Pixel-Level Annotations for Semantic Segmentation Promotion

URL: http://arxiv.org/abs/2512.13014v1
Date: Mon, 15 Dec 2025 06:21:30 GMT
Title: JoDiffusion: Jointly Diffusing Image with Pixel-Level Annotations for Semantic Segmentation Promotion
Authors: Haoyu Wang, Lei Zhang, Wenrui Liu, Dengyang Jiang, Wei Wei, Chen Ding,
Abstract summary: We present a novel dataset generative diffusion framework for semantic segmentation, termed JoDiffusion.<n>JoDiffusion enables simultaneously generating paired images and semantically consistent annotation masks solely conditioned on text prompts.<n> Experiments on Pascal VOC, COCO, and ADE20K datasets show that the annotated dataset generated by JoDiffusion yields substantial performance improvements in semantic segmentation compared to existing methods.
Score: 13.484321670536291
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Given the inherently costly and time-intensive nature of pixel-level annotation, the generation of synthetic datasets comprising sufficiently diverse synthetic images paired with ground-truth pixel-level annotations has garnered increasing attention recently for training high-performance semantic segmentation models. However, existing methods necessitate to either predict pseudo annotations after image generation or generate images conditioned on manual annotation masks, which incurs image-annotation semantic inconsistency or scalability problem. To migrate both problems with one stone, we present a novel dataset generative diffusion framework for semantic segmentation, termed JoDiffusion. Firstly, given a standard latent diffusion model, JoDiffusion incorporates an independent annotation variational auto-encoder (VAE) network to map annotation masks into the latent space shared by images. Then, the diffusion model is tailored to capture the joint distribution of each image and its annotation mask conditioned on a text prompt. By doing these, JoDiffusion enables simultaneously generating paired images and semantically consistent annotation masks solely conditioned on text prompts, thereby demonstrating superior scalability. Additionally, a mask optimization strategy is developed to mitigate the annotation noise produced during generation. Experiments on Pascal VOC, COCO, and ADE20K datasets show that the annotated dataset generated by JoDiffusion yields substantial performance improvements in semantic segmentation compared to existing methods.

Related papers

Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers [56.76198904599581]
Text-to-image diffusion models excel at translating language prompts into implicitly grounding concepts through their cross-modal attention mechanisms.<n>Recent multi-modal diffusion transformers extend this by introducing joint self-attentiond image and text tokens, enabling richer and more scalable cross-modal alignment.<n>We introduce Seg4Diff, a systematic framework for analyzing the attention structures of MM-DiT, with a focus on how specific layers propagate semantic information from text to image.
arXiv Detail & Related papers (2025-09-22T17:59:54Z)
GS: Generative Segmentation via Label Diffusion [59.380173266566715]
Language-driven image segmentation is a fundamental task in vision-language understanding, requiring models to segment regions of an image corresponding to natural language expressions.<n>Recent diffusion models have been introduced to this domain, but existing approaches remain image-centric.<n>We propose GS (Generative Label), a novel framework that formulates segmentation itself as a generative task.<n> Experimental results show that GS significantly outperforms existing discriminative and diffusion-based methods, setting a new state-of-the-art for language-driven segmentation.
arXiv Detail & Related papers (2025-08-27T16:28:15Z)
SeeDiff: Off-the-Shelf Seeded Mask Generation from Diffusion Models [16.109077391631917]
We show that cross-attention alone provides very coarse object localization, which however can provide initial seeds.<n>We also observe that a simple-text-guided synthetic image often has a uniform background, which is easier to find correspondences.<n>Our proposed method, dubbed SeeDiff, generates high-quality masks off-the-shelf from Stable Diffusion.
arXiv Detail & Related papers (2025-07-26T05:44:00Z)
Contrastive Learning with Diffusion Features for Weakly Supervised Medical Image Segmentation [12.530950480385554]
Conditional diffusion model (CDM) has been used as an alternative for generating segmentation masks.<n>We introduce Contrastive Learning with Diffusion Features (CLDF) to train a pixel decoder to map the diffusion features from a frozen CDM to a low-dimensional embedding space for segmentation.
arXiv Detail & Related papers (2025-06-30T01:43:50Z)
Free-Mask: A Novel Paradigm of Integration Between the Segmentation Diffusion Model and Image Editing [8.654930768124844]
We propose a framework textbfFree-Mask that combines a Diffusion Model for segmentation with advanced image editing capabilities.<n>Results show that textbfFree-Mask achieves new state-of-the-art results on previously unseen classes in the VOC 2012 benchmark.
arXiv Detail & Related papers (2024-11-04T05:39:01Z)
Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models [1.6450779686641077]
We introduce Open-Vocabulary Attention Maps (OVAM)-a training-free method for text-to-image diffusion models. We evaluate these tokens within existing state-of-the-art Stable Diffusion extensions.
arXiv Detail & Related papers (2024-03-21T10:56:12Z)
EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models [52.3015009878545]
We develop an image segmentor capable of generating fine-grained segmentation maps without any additional training. Our framework identifies semantic correspondences between image pixels and spatial locations of low-dimensional feature maps. In extensive experiments, the produced segmentation maps are demonstrated to be well delineated and capture detailed parts of the images.
arXiv Detail & Related papers (2024-01-22T07:34:06Z)
MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask [84.84034179136458]
A crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning. We propose an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features. Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models.
arXiv Detail & Related papers (2023-09-08T15:53:37Z)
Exploring Limits of Diffusion-Synthetic Training with Weakly Supervised Semantic Segmentation [16.863038973001483]
This work introduces three techniques for diffusion-synthetic semantic segmentation training. First, reliability-aware robust training, originally used in weakly supervised learning, helps segmentation with insufficient synthetic mask quality. Second, large-scale pretraining of whole segmentation models, not only backbones, on synthetic ImageNet-1k-class images with pixel-labels benefits downstream segmentation tasks. Third, we introduce prompt augmentation, data augmentation to the prompt text set to scale up and diversify training images with a limited text resources.
arXiv Detail & Related papers (2023-09-04T05:34:19Z)
DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models [68.21154597227165]
We show that it is possible to automatically obtain accurate semantic masks of synthetic images generated by the Off-the-shelf Stable Diffusion model. Our approach, called DiffuMask, exploits the potential of the cross-attention map between text and image.
arXiv Detail & Related papers (2023-03-21T08:43:15Z)
Semantic Image Synthesis via Diffusion Models [174.24523061460704]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.<n>Recent work on semantic image synthesis mainly follows the de facto GAN-based approaches.<n>We propose a novel framework based on DDPM for semantic image synthesis.
arXiv Detail & Related papers (2022-06-30T18:31:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.