PathDiff: Histopathology Image Synthesis with Unpaired Text and Mask Conditions
- URL: http://arxiv.org/abs/2506.23440v1
- Date: Mon, 30 Jun 2025 00:31:03 GMT
- Title: PathDiff: Histopathology Image Synthesis with Unpaired Text and Mask Conditions
- Authors: Mahesh Bhosale, Abdul Wasi, Yuanhao Zhai, Yunjie Tian, Samuel Border, Nan Xi, Pinaki Sarder, Junsong Yuan, David Doermann, Xuan Gong,
- Abstract summary: Public datasets lack paired text and mask data for the same histopathology images.<n>We propose PathDiff, a diffusion framework that effectively learns from unpaired mask-text data.<n> PathDiff allows precise control over structural and contextual features, generating high-quality, semantically accurate images.
- Score: 38.32128533564591
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion-based generative models have shown promise in synthesizing histopathology images to address data scarcity caused by privacy constraints. Diagnostic text reports provide high-level semantic descriptions, and masks offer fine-grained spatial structures essential for representing distinct morphological regions. However, public datasets lack paired text and mask data for the same histopathological images, limiting their joint use in image generation. This constraint restricts the ability to fully exploit the benefits of combining both modalities for enhanced control over semantics and spatial details. To overcome this, we propose PathDiff, a diffusion framework that effectively learns from unpaired mask-text data by integrating both modalities into a unified conditioning space. PathDiff allows precise control over structural and contextual features, generating high-quality, semantically accurate images. PathDiff also improves image fidelity, text-image alignment, and faithfulness, enhancing data augmentation for downstream tasks like nuclei segmentation and classification. Extensive experiments demonstrate its superiority over existing methods.
Related papers
- Graph Conditioned Diffusion for Controllable Histopathology Image Generation [26.102552837222103]
We propose graph-based object-level representations for Graph-Conditioned-Diffusion.<n>Our approach generates graph nodes corresponding to each major structure in the image, encapsulating their individual features and relationships.<n>We evaluate this approach using a real-world histopathology use case, demonstrating that our generated data can reliably substitute for annotated patient data in downstream segmentation tasks.
arXiv Detail & Related papers (2025-10-08T15:26:08Z) - Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers [56.76198904599581]
Text-to-image diffusion models excel at translating language prompts into implicitly grounding concepts through their cross-modal attention mechanisms.<n>Recent multi-modal diffusion transformers extend this by introducing joint self-attentiond image and text tokens, enabling richer and more scalable cross-modal alignment.<n>We introduce Seg4Diff, a systematic framework for analyzing the attention structures of MM-DiT, with a focus on how specific layers propagate semantic information from text to image.
arXiv Detail & Related papers (2025-09-22T17:59:54Z) - CoSimGen: Controllable Diffusion Model for Simultaneous Image and Mask Generation [1.9393128408121891]
Existing generative models fail to address the need for high-quality, simultaneous image-mask generation.<n>We propose CoSimGen, a diffusion-based framework for controllable simultaneous image and mask generation.<n>CoSimGen achieves state-of-the-art performance across all datasets, achieving the lowest KID of 0.11 and LPIPS of 0.53 across datasets.
arXiv Detail & Related papers (2025-03-25T13:48:22Z) - RL4Med-DDPO: Reinforcement Learning for Controlled Guidance Towards Diverse Medical Image Generation using Vision-Language Foundation Models [0.7165255458140439]
Vision-Language Foundation Models (VLFM) have shown a tremendous increase in performance in terms of generating high-resolution, photorealistic natural images.<n>We propose a multi-stage architecture where a pre-trained VLFM provides a cursory semantic understanding, while a reinforcement learning algorithm refines the alignment through an iterative process.<n>The reward signal is designed to align the semantic information of the text with synthesized images.
arXiv Detail & Related papers (2025-03-20T01:51:05Z) - Synthetic Lung X-ray Generation through Cross-Attention and Affinity Transformation [4.956977275061966]
This paper introduces a new method for the automatic generation of accurate semantic masks from synthetic lung X-ray images.<n>It uses cross-attention mapping between text and image to extend text-driven image synthesis to semantic mask generation.<n>The experimental results demonstrate that segmentation models trained on synthetic data generated using the method are comparable to, and in some cases even better than, models trained on real datasets.
arXiv Detail & Related papers (2025-03-10T11:48:26Z) - Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision [87.15580604023555]
Unpair-Seg is a novel weakly-supervised open-vocabulary segmentation framework.
It learns from unpaired image-mask and image-text pairs, which can be independently and efficiently collected.
It achieves 14.6% and 19.5% mIoU on the ADE-847 and PASCAL Context-459 datasets.
arXiv Detail & Related papers (2024-02-14T06:01:44Z) - Learned representation-guided diffusion models for large-image generation [58.192263311786824]
We introduce a novel approach that trains diffusion models conditioned on embeddings from self-supervised learning (SSL)
Our diffusion models successfully project these features back to high-quality histopathology and remote sensing images.
Augmenting real data by generating variations of real images improves downstream accuracy for patch-level and larger, image-scale classification tasks.
arXiv Detail & Related papers (2023-12-12T14:45:45Z) - MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask [84.84034179136458]
A crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning.
We propose an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features.
Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models.
arXiv Detail & Related papers (2023-09-08T15:53:37Z) - Exploring Semantic Consistency in Unpaired Image Translation to Generate
Data for Surgical Applications [1.8011391924021904]
This study empirically investigates unpaired image translation methods for generating suitable data in surgical applications.
We find that a simple combination of structural-similarity loss and contrastive learning yields the most promising results.
arXiv Detail & Related papers (2023-09-06T14:43:22Z) - Zero-shot spatial layout conditioning for text-to-image diffusion models [52.24744018240424]
Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling.
We consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content.
We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models.
arXiv Detail & Related papers (2023-06-23T19:24:48Z) - Hierarchical Diffusion Autoencoders and Disentangled Image Manipulation [36.20575570779196]
We exploit the fine-grained-to-abstract and lowlevel-to-high-level feature hierarchy for the latent space of diffusion models.
The hierarchical latent space of HDAE inherently encodes different abstract levels of semantics and provides more comprehensive semantic representations.
We demonstrate the effectiveness of our proposed approach with extensive experiments and applications on image reconstruction, style mixing, controllable, detail-preserving and disentangled image manipulation.
arXiv Detail & Related papers (2023-04-24T05:35:59Z) - DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic
Segmentation Using Diffusion Models [68.21154597227165]
We show that it is possible to automatically obtain accurate semantic masks of synthetic images generated by the Off-the-shelf Stable Diffusion model.
Our approach, called DiffuMask, exploits the potential of the cross-attention map between text and image.
arXiv Detail & Related papers (2023-03-21T08:43:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.