Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images
- URL: http://arxiv.org/abs/2507.12698v1
- Date: Thu, 17 Jul 2025 00:17:50 GMT
- Title: Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images
- Authors: Zahra TehraniNasab, Amar Kumar, Tal Arbel,
- Abstract summary: We introduce Pixel Perfect MegaMed, the first vision-language foundation model to synthesize images at resolutions of 1024x1024.<n>Our method deploys a multi-scale transformer architecture designed specifically for ultra-high resolution medical image generation.<n>By leveraging vision-language alignment techniques tailored to medical terminology and imaging modalities, Pixel Perfect MegaMed bridges the gap between textual descriptions and visual representations at unprecedented resolution levels.
- Score: 0.8397730500554048
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Medical image synthesis presents unique challenges due to the inherent complexity and high-resolution details required in clinical contexts. Traditional generative architectures such as Generative Adversarial Networks (GANs) or Variational Auto Encoder (VAEs) have shown great promise for high-resolution image generation but struggle with preserving fine-grained details that are key for accurate diagnosis. To address this issue, we introduce Pixel Perfect MegaMed, the first vision-language foundation model to synthesize images at resolutions of 1024x1024. Our method deploys a multi-scale transformer architecture designed specifically for ultra-high resolution medical image generation, enabling the preservation of both global anatomical context and local image-level details. By leveraging vision-language alignment techniques tailored to medical terminology and imaging modalities, Pixel Perfect MegaMed bridges the gap between textual descriptions and visual representations at unprecedented resolution levels. We apply our model to the CheXpert dataset and demonstrate its ability to generate clinically faithful chest X-rays from text prompts. Beyond visual quality, these high-resolution synthetic images prove valuable for downstream tasks such as classification, showing measurable performance gains when used for data augmentation, particularly in low-data regimes. Our code is accessible through the project website - https://tehraninasab.github.io/pixelperfect-megamed.
Related papers
- MedIL: Implicit Latent Spaces for Generating Heterogeneous Medical Images at Arbitrary Resolutions [2.2427832125073732]
MedIL is a first-of-its-kind autoencoder built for encoding medical images with heterogeneous sizes and resolutions.<n>We show how MedIL compresses and preserves clinically-relevant features over large multi-site, multi-resolution datasets.
arXiv Detail & Related papers (2025-04-12T19:52:56Z) - A Unified Model for Compressed Sensing MRI Across Undersampling Patterns [69.19631302047569]
We propose a unified MRI reconstruction model robust to various measurement undersampling patterns and image resolutions.<n>Our model improves SSIM by 11% and PSNR by 4 dB over a state-of-the-art CNN (End-to-End VarNet) with 600$times$ faster inference than diffusion methods.
arXiv Detail & Related papers (2024-10-05T20:03:57Z) - TransResNet: Integrating the Strengths of ViTs and CNNs for High Resolution Medical Image Segmentation via Feature Grafting [6.987177704136503]
High-resolution images are preferable in medical imaging domain as they significantly improve the diagnostic capability of the underlying method.
Most of the existing deep learning-based techniques for medical image segmentation are optimized for input images having small spatial dimensions and perform poorly on high-resolution images.
We propose a parallel-in-branch architecture called TransResNet, which incorporates Transformer and CNN in a parallel manner to extract features from multi-resolution images independently.
arXiv Detail & Related papers (2024-10-01T18:22:34Z) - Disruptive Autoencoders: Leveraging Low-level features for 3D Medical
Image Pre-training [51.16994853817024]
This work focuses on designing an effective pre-training framework for 3D radiology images.
We introduce Disruptive Autoencoders, a pre-training framework that attempts to reconstruct the original image from disruptions created by a combination of local masking and low-level perturbations.
The proposed pre-training framework is tested across multiple downstream tasks and achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-07-31T17:59:42Z) - Fine-tuned Generative Adversarial Network-based Model for Medical Image Super-Resolution [2.647302105102753]
Real-Enhanced Super-Resolution Generative Adversarial Network (Real-ESRGAN) is a practical model for recovering HR images from real-world LR images.
We employ the high-order degradation model of the Real-ESRGAN which better simulates real-world image degradations.
The proposed model achieves superior perceptual quality compared to the Real-ESRGAN model, effectively preserving fine details and generating images with more realistic textures.
arXiv Detail & Related papers (2022-11-01T16:48:04Z) - Histopathology DatasetGAN: Synthesizing Large-Resolution Histopathology
Datasets [0.0]
Histopathology datasetGAN (HDGAN) is a framework for image generation and segmentation that scales well to large-resolution histopathology images.
We make several adaptations from the original framework, including updating the generative backbone, selectively extracting latent features from the generator, and switching to memory-mapped arrays.
We evaluate HDGAN on a thrombotic microangiopathy high-resolution tile dataset, demonstrating strong performance on the high-resolution image-annotation generation task.
arXiv Detail & Related papers (2022-07-06T14:33:50Z) - Medical Transformer: Gated Axial-Attention for Medical Image
Segmentation [73.98974074534497]
We study the feasibility of using Transformer-based network architectures for medical image segmentation tasks.
We propose a Gated Axial-Attention model which extends the existing architectures by introducing an additional control mechanism in the self-attention module.
To train the model effectively on medical images, we propose a Local-Global training strategy (LoGo) which further improves the performance.
arXiv Detail & Related papers (2021-02-21T18:35:14Z) - Multi-Texture GAN: Exploring the Multi-Scale Texture Translation for
Brain MR Images [1.9163481966968943]
A significant percentage of existing algorithms cannot explicitly exploit and preserve texture details from target scanners.
In this paper, we design a multi-scale texture transfer to enrich the reconstruction images with more details.
Our method achieves superior results in inter-protocol or inter-scanner translation over state-of-the-art methods.
arXiv Detail & Related papers (2021-02-14T19:14:06Z) - TransUNet: Transformers Make Strong Encoders for Medical Image
Segmentation [78.01570371790669]
Medical image segmentation is an essential prerequisite for developing healthcare systems.
On various medical image segmentation tasks, the u-shaped architecture, also known as U-Net, has become the de-facto standard.
We propose TransUNet, which merits both Transformers and U-Net, as a strong alternative for medical image segmentation.
arXiv Detail & Related papers (2021-02-08T16:10:50Z) - SAFRON: Stitching Across the Frontier for Generating Colorectal Cancer
Histology Images [2.486942181212742]
Synthetic images can be used for the development and evaluation of deep learning algorithms in the context of limited availability of data.
We propose a novel SAFRON framework to construct realistic, large high resolution tissue image tiles from ground truth annotations.
We show that the proposed method can generate realistic image tiles of arbitrarily large size after training it on relatively small image patches.
arXiv Detail & Related papers (2020-08-11T05:47:00Z) - Hierarchical Amortized Training for Memory-efficient High Resolution 3D
GAN [52.851990439671475]
We propose a novel end-to-end GAN architecture that can generate high-resolution 3D images.
We achieve this goal by using different configurations between training and inference.
Experiments on 3D thorax CT and brain MRI demonstrate that our approach outperforms state of the art in image generation.
arXiv Detail & Related papers (2020-08-05T02:33:04Z) - Towards Coding for Human and Machine Vision: A Scalable Image Coding
Approach [104.02201472370801]
We come up with a novel image coding framework by leveraging both the compressive and the generative models.
By introducing advanced generative models, we train a flexible network to reconstruct images from compact feature representations and the reference pixels.
Experimental results demonstrate the superiority of our framework in both human visual quality and facial landmark detection.
arXiv Detail & Related papers (2020-01-09T10:37:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.