IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance
- URL: http://arxiv.org/abs/2509.26231v1
- Date: Tue, 30 Sep 2025 13:27:03 GMT
- Title: IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance
- Authors: Jiayi Guo, Chuanhao Yan, Xingqian Xu, Yulin Wang, Kai Wang, Gao Huang, Humphrey Shi,
- Abstract summary: Implicit Multimodal Guidance (IMG) is a novel re-generation-based multimodal alignment framework.<n>IMG identifies misalignments and formulates the re-alignment goal into a trainable objective.<n>IMG acts as a flexible plug-and-play adapter, seamlessly enhancing prior finetuning-based alignment methods.
- Score: 74.89810064807142
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Ensuring precise multimodal alignment between diffusion-generated images and input prompts has been a long-standing challenge. Earlier works finetune diffusion weight using high-quality preference data, which tends to be limited and difficult to scale up. Recent editing-based methods further refine local regions of generated images but may compromise overall image quality. In this work, we propose Implicit Multimodal Guidance (IMG), a novel re-generation-based multimodal alignment framework that requires no extra data or editing operations. Specifically, given a generated image and its prompt, IMG a) utilizes a multimodal large language model (MLLM) to identify misalignments; b) introduces an Implicit Aligner that manipulates diffusion conditioning features to reduce misalignments and enable re-generation; and c) formulates the re-alignment goal into a trainable objective, namely Iteratively Updated Preference Objective. Extensive qualitative and quantitative evaluations on SDXL, SDXL-DPO, and FLUX show that IMG outperforms existing alignment methods. Furthermore, IMG acts as a flexible plug-and-play adapter, seamlessly enhancing prior finetuning-based alignment methods. Our code will be available at https://github.com/SHI-Labs/IMG-Multimodal-Diffusion-Alignment.
Related papers
- DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis [63.59932602411222]
DMAligner is a diffusion-based framework for image alignment through alignment-oriented view synthesis.<n>We propose a Dynamics-aware Diffusion Training approach for learning conditional image generation.<n>We develop the Dynamic Scene Image Alignment (DSIA) dataset using Blender, which includes 1,033 indoor and outdoor scenes with over 30K image pairs tailored for image alignment.
arXiv Detail & Related papers (2026-02-26T14:00:07Z) - Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach [99.80480649258557]
DiTFuse is an instruction-driven framework that performs semantics-aware fusion within a single model.<n>Experiments on public IVIF, MFF, and MEF benchmarks confirm superior quantitative and qualitative performance, sharper textures, and better semantic retention.
arXiv Detail & Related papers (2025-12-08T05:04:54Z) - VLMDiff: Leveraging Vision-Language Models for Multi-Class Anomaly Detection with Diffusion [15.486565360380203]
ours is a novel unsupervised multi-class visual anomaly detection framework.<n>It integrates a Latent Diffusion Model (LDM) with a Vision-Language Model (VLM) for enhanced anomaly localization and detection.<n>Our method achieves competitive performance, improving the pixel-level Per-Region-Overlap (PRO) metric by up to 25 points on the Real-IAD dataset and 8 points on the COCO-AD dataset.
arXiv Detail & Related papers (2025-11-11T12:37:38Z) - Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation [63.50827603618498]
We propose Lavida-O, a unified Masked Diffusion Model (MDM) for multimodal understanding and generation.<n>Lavida-O presents a single framework that enables image-level understanding, object grounding, image editing, and high-resolution text-to-image synthesis.<n>Lavida-O achieves state-of-the-art performance on a wide range of benchmarks including RefCOCO object grounding, GenEval text-to-image generation, and ImgEdit image editing.
arXiv Detail & Related papers (2025-09-23T17:05:46Z) - Marmot: Object-Level Self-Correction via Multi-Agent Reasoning [55.74860093731475]
Marmot is a novel and generalizable framework that leverages Multi-Agent Reasoning for Multi-Object Self-Correcting.<n>Marmot significantly improves accuracy in object counting, attribute assignment, and spatial relationships for image generation tasks.
arXiv Detail & Related papers (2025-04-10T16:54:28Z) - OSDM-MReg: Multimodal Image Registration based One Step Diffusion Model [16.850096473419505]
Multimodal remote sensing image registration aligns images from different sensors for data fusion and analysis.<n>We propose OSDM-MReg, a novel multimodal image registration framework that bridges the modality gap through image-to-image translation.
arXiv Detail & Related papers (2025-04-08T13:32:56Z) - Unified Multimodal Discrete Diffusion [78.48930545306654]
Multimodal generative models that can understand and generate across multiple modalities are dominated by autoregressive (AR) approaches.<n>We explore discrete diffusion models as a unified generative formulation in the joint text and image domain.<n>We present the first Unified Multimodal Discrete Diffusion (UniDisc) model which is capable of jointly understanding and generating text and images.
arXiv Detail & Related papers (2025-03-26T17:59:51Z) - EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM [38.8308841469793]
This paper introduces EasyRef, a novel plug-and-play adaptation method that enables diffusion models to be conditioned on multiple reference images and the text prompt.<n>We leverage the multi-image comprehension and instruction-following capabilities of the multimodal large language model (MLLM) to exploit consistent visual elements within multiple images.<n> Experimental results demonstrate EasyRef surpasses both tuning-free methods like IP-Adapter and tuning-based methods like LoRA, achieving superior aesthetic quality and robust zero-shot generalization across diverse domains.
arXiv Detail & Related papers (2024-12-12T18:59:48Z) - Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation [54.96563068182733]
We propose Modality Adaptation with text-to-image Diffusion Models (MADM) for semantic segmentation task.
MADM utilizes text-to-image diffusion models pre-trained on extensive image-text pairs to enhance the model's cross-modality capabilities.
We show that MADM achieves state-of-the-art adaptation performance across various modality tasks, including images to depth, infrared, and event modalities.
arXiv Detail & Related papers (2024-10-29T03:49:40Z) - Large Language Models for Multimodal Deformable Image Registration [50.91473745610945]
We propose a novel coarse-to-fine MDIR framework,LLM-Morph, for aligning the deep features from different modal medical images.
Specifically, we first utilize a CNN encoder to extract deep visual features from cross-modal image pairs, then we use the first adapter to adjust these tokens, and use LoRA in pre-trained LLMs to fine-tune their weights.
Third, for the alignment of tokens, we utilize other four adapters to transform the LLM-encoded tokens into multi-scale visual features, generating multi-scale deformation fields and facilitating the coarse-to-fine MDIR task
arXiv Detail & Related papers (2024-08-20T09:58:30Z) - MM-Diff: High-Fidelity Image Personalization via Multi-Modal Condition Integration [7.087475633143941]
MM-Diff is a tuning-free image personalization framework capable of generating high-fidelity images of both single and multiple subjects in seconds.
MM-Diff employs a vision encoder to transform the input image into CLS and patch embeddings.
CLS embeddings are used on the one hand to augment the text embeddings, and on the other hand together with patch embeddings to derive a small number of detail-rich subject embeddings.
arXiv Detail & Related papers (2024-03-22T09:32:31Z) - LLMGA: Multimodal Large Language Model based Generation Assistant [53.150283805515926]
We introduce a Multimodal Large Language Model-based Generation Assistant (LLMGA) to assist users in image generation and editing.
We train the MLLM to grasp the properties of image generation and editing, enabling it to generate detailed prompts.
Extensive results show that LLMGA has promising generation and editing capabilities and can enable more flexible and expansive applications.
arXiv Detail & Related papers (2023-11-27T13:37:26Z) - MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation [34.61940502872307]
MultiDiffusion is a unified framework that enables versatile and controllable image generation.
We show that MultiDiffusion can be readily applied to generate high quality and diverse images.
arXiv Detail & Related papers (2023-02-16T06:28:29Z) - Blended Latent Diffusion [18.043090347648157]
We present an accelerated solution to the task of local text-driven editing of generic images, where the desired edits are confined to a user-provided mask.
Our solution leverages a recent text-to-image Latent Diffusion Model (LDM), which speeds up diffusion by operating in a lower-dimensional latent space.
arXiv Detail & Related papers (2022-06-06T17:58:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.