A Unified Framework for Multimodal Image Reconstruction and Synthesis using Denoising Diffusion Models
- URL: http://arxiv.org/abs/2602.08249v1
- Date: Mon, 09 Feb 2026 03:54:24 GMT
- Title: A Unified Framework for Multimodal Image Reconstruction and Synthesis using Denoising Diffusion Models
- Authors: Weijie Gan, Xucheng Wang, Tongyao Wang, Wenshang Wang, Chunwei Ying, Yuyang Hu, Yasheng Chen, Hongyu An, Ulugbek S. Kamilov,
- Abstract summary: We introduce Any2all, a unified framework that addresses the limitations of existing methods.<n>We train a single, unconditional diffusion model on the complete multimodal data stack.<n>This model is adapted at inference time to inpaint'' all target modalities from any combination of inputs of available clean images or noisy measurements.<n>Our results show that Any2all can achieve excellent performance on both multimodal reconstruction and synthesis tasks.
- Score: 12.36766048544934
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image reconstruction and image synthesis are important for handling incomplete multimodal imaging data, but existing methods require various task-specific models, complicating training and deployment workflows. We introduce Any2all, a unified framework that addresses this limitation by formulating these disparate tasks as a single virtual inpainting problem. We train a single, unconditional diffusion model on the complete multimodal data stack. This model is then adapted at inference time to ``inpaint'' all target modalities from any combination of inputs of available clean images or noisy measurements. We validated Any2all on a PET/MR/CT brain dataset. Our results show that Any2all can achieve excellent performance on both multimodal reconstruction and synthesis tasks, consistently yielding images with competitive distortion-based performance and superior perceptual quality over specialized methods.
Related papers
- Multi-View Learning with Context-Guided Receptance for Image Denoising [18.175992709188026]
Image denoising is essential in low-level vision applications such as photography and automated driving.<n>Existing methods struggle with distinguishing complex noise patterns in real-world scenes and consume significant computational resources.<n>In this work, a Context-guided Receptance Weighted Key-Value (M) model is proposed, combining enhanced multi-view feature integration with efficient sequence modeling.<n>The model is validated on multiple real-world image denoising datasets, outperforming the existing state-of-the-art methods quantitatively and reducing inference time up to 40%.
arXiv Detail & Related papers (2025-05-05T14:57:43Z) - UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing [59.590505989071175]
Text-to-Image (T2I) diffusion models have shown impressive results in generating visually compelling images following user prompts.<n>We introduce UniVG, a generalist diffusion model capable of supporting a diverse range of image generation tasks with a single set of weights.
arXiv Detail & Related papers (2025-03-16T21:11:25Z) - Many-to-many Image Generation with Auto-regressive Diffusion Models [59.5041405824704]
This paper introduces a domain-general framework for many-to-many image generation, capable of producing interrelated image series from a given set of images.
We present MIS, a novel large-scale multi-image dataset, containing 12M synthetic multi-image samples, each with 25 interconnected images.
We learn M2M, an autoregressive model for many-to-many generation, where each image is modeled within a diffusion framework.
arXiv Detail & Related papers (2024-04-03T23:20:40Z) - Exposure Bracketing Is All You Need For A High-Quality Image [50.822601495422916]
Multi-exposure images are complementary in denoising, deblurring, high dynamic range imaging, and super-resolution.<n>We propose to utilize exposure bracketing photography to get a high-quality image by combining these tasks in this work.<n>In particular, a temporally modulated recurrent network (TMRNet) and self-supervised adaptation method are proposed.
arXiv Detail & Related papers (2024-01-01T14:14:35Z) - Image Generation with Multimodal Priors using Denoising Diffusion
Probabilistic Models [54.1843419649895]
A major challenge in using generative models to accomplish this task is the lack of paired data containing all modalities and corresponding outputs.
We propose a solution based on a denoising diffusion probabilistic synthesis models to generate images under multi-model priors.
arXiv Detail & Related papers (2022-06-10T12:23:05Z) - DiVAE: Photorealistic Images Synthesis with Denoising Diffusion Decoder [73.1010640692609]
We propose a VQ-VAE architecture model with a diffusion decoder (DiVAE) to work as the reconstructing component in image synthesis.
Our model achieves state-of-the-art results and generates more photorealistic images specifically.
arXiv Detail & Related papers (2022-06-01T10:39:12Z) - Cascading Modular Network (CAM-Net) for Multimodal Image Synthesis [7.726465518306907]
A persistent challenge has been to generate diverse versions of output images from the same input image.
We propose CAM-Net, a unified architecture that can be applied to a broad range of tasks.
It is capable of generating convincing high frequency details, achieving a reduction of the Frechet Inception Distance (FID) by up to 45.3% compared to the baseline.
arXiv Detail & Related papers (2021-06-16T17:58:13Z) - Multimodal Image Synthesis with Conditional Implicit Maximum Likelihood
Estimation [54.17177006826262]
We develop a new generic conditional image synthesis method based on Implicit Maximum Likelihood Estimation (IMLE)
We demonstrate improved multimodal image synthesis performance on two tasks, single image super-resolution and image synthesis from scene layouts.
arXiv Detail & Related papers (2020-04-07T03:06:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.