Related papers: Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion

Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion

URL: http://arxiv.org/abs/2505.20053v1
Date: Mon, 26 May 2025 14:42:35 GMT
Title: Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion
Authors: Zheqi Lv, Junhao Chen, Qi Tian, Keting Yin, Shengyu Zhang, Fei Wu,
Abstract summary: MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD) is a novel framework that introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference.<n>It performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps.<n>Extensive experiments demonstrate PPAD's significant improvements.
Score: 52.315729095824906
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion models have become the mainstream architecture for text-to-image generation, achieving remarkable progress in visual quality and prompt controllability. However, current inference pipelines generally lack interpretable semantic supervision and correction mechanisms throughout the denoising process. Most existing approaches rely solely on post-hoc scoring of the final image, prompt filtering, or heuristic resampling strategies-making them ineffective in providing actionable guidance for correcting the generative trajectory. As a result, models often suffer from object confusion, spatial errors, inaccurate counts, and missing semantic elements, severely compromising prompt-image alignment and image quality. To tackle these challenges, we propose MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD), a novel framework that, for the first time, introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference. PPAD performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps. The framework supports both inference-only and training-enhanced settings, and performs semantic correction at only extremely few diffusion steps, offering strong generality and scalability. Extensive experiments demonstrate PPAD's significant improvements.

Related papers

Cross-Subject Mind Decoding from Inaccurate Representations [42.19569985029642]
We propose a Bi Autoencoder Intertwining framework for accurate decoded representation prediction.<n>Our method outperforms state-of-the-art approaches on benchmark datasets in both qualitative and quantitative evaluations.
arXiv Detail & Related papers (2025-07-25T08:45:02Z)
Beyond Fully Supervised Pixel Annotations: Scribble-Driven Weakly-Supervised Framework for Image Manipulation Localization [11.10178274806454]
We propose a form of weak supervision that improves the annotation efficiency and detection performance.<n>We re-annotated mainstream IML datasets with scribble labels and propose the first scribble-based IML dataset.<n>We employ self-supervised training with a structural consistency loss to encourage the model to produce consistent predictions.
arXiv Detail & Related papers (2025-07-17T11:45:27Z)
Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints [15.541287957548771]
We propose a Coarse-to-fine Consistency Constraints Visual Grounding architecture.<n>It integrates implicit and explicit modeling approaches within a two-stage framework.<n>It significantly outperforms state-of-the-art REC and RIS methods by a substantial margin.
arXiv Detail & Related papers (2025-01-12T04:30:13Z)
Multi-Modality Driven LoRA for Adverse Condition Depth Estimation [61.525312117638116]
We propose Multi-Modality Driven LoRA (MMD-LoRA) for Adverse Condition Depth Estimation.<n>It consists of two core components: Prompt Driven Domain Alignment (PDDA) and Visual-Text Consistent Contrastive Learning (VTCCL)<n>It achieves state-of-the-art performance on the nuScenes and Oxford RobotCar datasets.
arXiv Detail & Related papers (2024-12-28T14:23:58Z)
Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis [9.11767497956649]
This paper proposes leveraging the language comprehension capabilities of large vision-language models to guide the optimization of the initial noisy latent. We introduce the Noise Diffusion process, which updates the noisy latent to generate semantically faithful images while preserving distribution consistency. Experimental results demonstrate the effectiveness and adaptability of our framework, consistently enhancing semantic alignment across various diffusion models.
arXiv Detail & Related papers (2024-11-25T15:40:47Z)
Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation [54.96563068182733]
We propose Modality Adaptation with text-to-image Diffusion Models (MADM) for semantic segmentation task. MADM utilizes text-to-image diffusion models pre-trained on extensive image-text pairs to enhance the model's cross-modality capabilities. We show that MADM achieves state-of-the-art adaptation performance across various modality tasks, including images to depth, infrared, and event modalities.
arXiv Detail & Related papers (2024-10-29T03:49:40Z)
SCA: Improve Semantic Consistent in Unrestricted Adversarial Attacks via DDPM Inversion [27.7252951625431]
We propose a novel framework called Semantic-Consistent Unrestricted Adversarial Attacks (SCA)<n>SCA employs an inversion method to extract edit-friendly noise maps and utilizes a Multimodal Large Language Model (MLLM) to provide semantic guidance.<n>Our framework enables the efficient generation of adversarial examples that exhibit minimal discernible semantic changes.
arXiv Detail & Related papers (2024-10-03T06:25:53Z)
Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing. Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image. To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z)
Multiscale Structure Guided Diffusion for Image Deblurring [24.09642909404091]
Diffusion Probabilistic Models (DPMs) have been employed for image deblurring. We introduce a simple yet effective multiscale structure guidance as an implicit bias. We demonstrate more robust deblurring results with fewer artifacts on unseen data.
arXiv Detail & Related papers (2022-12-04T10:40:35Z)
Diffusion Visual Counterfactual Explanations [51.077318228247925]
Visual Counterfactual Explanations (VCEs) are an important tool to understand the decisions of an image. Current approaches for the generation of VCEs are restricted to adversarially robust models and often contain non-realistic artefacts. In this paper, we overcome this by generating Visual Diffusion Counterfactual Explanations (DVCEs) for arbitrary ImageNet classifiers.
arXiv Detail & Related papers (2022-10-21T09:35:47Z)
Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text. These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining. We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.