Related papers: LLMControl: Grounded Control of Text-to-Image Diffusion-based Synthesis with Multimodal LLMs

LLMControl: Grounded Control of Text-to-Image Diffusion-based Synthesis with Multimodal LLMs

URL: http://arxiv.org/abs/2507.19939v1
Date: Sat, 26 Jul 2025 12:57:02 GMT
Title: LLMControl: Grounded Control of Text-to-Image Diffusion-based Synthesis with Multimodal LLMs
Authors: Jiaze Wang, Rui Chen, Haowang Cui,
Abstract summary: We present a framework called LLM_Control to address the challenges of controllable T2I generation task.<n>By improving grounding capabilities, LLM_Control is introduced to accurately modulate the pre-trained diffusion models.<n>We utilize the multimodal LLM as a global controller to arrange spatial layouts, augment semantic descriptions and bind object attributes.
Score: 3.6016438645365834
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent spatial control methods for text-to-image (T2I) diffusion models have shown compelling results. However, these methods still fail to precisely follow the control conditions and generate the corresponding images, especially when encountering the textual prompts that contain multiple objects or have complex spatial compositions. In this work, we present a LLM-guided framework called LLM\_Control to address the challenges of the controllable T2I generation task. By improving grounding capabilities, LLM\_Control is introduced to accurately modulate the pre-trained diffusion models, where visual conditions and textual prompts influence the structures and appearance generation in a complementary way. We utilize the multimodal LLM as a global controller to arrange spatial layouts, augment semantic descriptions and bind object attributes. The obtained control signals are injected into the denoising network to refocus and enhance attention maps according to novel sampling constraints. Extensive qualitative and quantitative experiments have demonstrated that LLM\_Control achieves competitive synthesis quality compared to other state-of-the-art methods across various pre-trained T2I models. It is noteworthy that LLM\_Control allows the challenging input conditions on which most of the existing methods

Related papers

CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion [62.04833878126661]
We tackle the dual challenges of video understanding and controllable video generation within a unified diffusion framework.<n>We propose CtrlVDiff, a unified diffusion model trained with a Hybrid Modality Control Strategy (HMCS) that routes and fuses features from depth, normals, segmentation, edges, and graphics-based intrinsics (albedo, roughness, metallic)<n>Across understanding and generation benchmarks, CtrlVDiff delivers superior controllability and fidelity, enabling layer-wise edits (relighting, material adjustment, object insertion) and surpassing state-of-the-art baselines while remaining robust when some modalities are unavailable.
arXiv Detail & Related papers (2025-11-26T07:27:11Z)
Controlling Multimodal LLMs via Reward-guided Decoding [17.5544679985101]
We study the adaptation of Multimodal Large Language Models (MLLMs) through controlled decoding.<n>Our method involves building reward models for visual grounding and using them to guide the MLLM's decoding process.<n>We evaluate our method on standard object hallucination benchmarks, showing that it provides significant controllability over MLLM inference.
arXiv Detail & Related papers (2025-08-15T17:29:06Z)
DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation [63.63429658282696]
We propose DynamicControl, which supports dynamic combinations of diverse control signals.<n>We show that DynamicControl is superior to existing methods in terms of controllability, generation quality and composability under various conditional controls.
arXiv Detail & Related papers (2024-12-04T11:54:57Z)
Training-Free Layout-to-Image Generation with Marginal Attention Constraints [73.55660250459132]
We propose a training-free layout-to-image (L2I) approach, which eliminates the need for additional modules or fine-tuning.<n>Specifically, we use text-visual cross-attention feature maps to quantify inconsistencies between the layout of the generated images and the provided instructions.<n>We leverage pixel-to-pixel correlations in the self-attention feature maps to align cross-attention maps and combine three loss functions constrained by boundary attention to update latent features.
arXiv Detail & Related papers (2024-11-15T05:44:45Z)
AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation [24.07613591217345]
Linguistic control enables effective content creation, but struggles with fine-grained control over image generation. AnyControl develops a novel Multi-Control framework that extracts a unified multi-modal embedding to guide the generation process. This approach enables a holistic understanding of user inputs, and produces high-quality, faithful results under versatile control signals.
arXiv Detail & Related papers (2024-06-27T07:40:59Z)
Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control [73.6361029556484]
Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs. We consider pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts. We show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark.
arXiv Detail & Related papers (2024-05-09T15:39:54Z)
FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition [41.92032568474062]
FreeControl is a training-free approach for controllable T2I generation. It supports multiple conditions, architectures, and checkpoints simultaneously. It achieves competitive synthesis quality with training-based approaches.
arXiv Detail & Related papers (2023-12-12T18:59:14Z)
Fine-grained Controllable Video Generation via Object Appearance and Context [74.23066823064575]
We propose fine-grained controllable video generation (FACTOR) to achieve detailed control. FACTOR aims to control objects' appearances and context, including their location and category. Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users.
arXiv Detail & Related papers (2023-12-05T17:47:33Z)
Self-correcting LLM-controlled Diffusion Models [83.26605445217334]
We introduce Self-correcting LLM-controlled Diffusion (SLD) SLD is a framework that generates an image from the input prompt, assesses its alignment with the prompt, and performs self-corrections on the inaccuracies in the generated image. Our approach can rectify a majority of incorrect generations, particularly in generative numeracy, attribute binding, and spatial relationships.
arXiv Detail & Related papers (2023-11-27T18:56:37Z)
Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation [79.8881514424969]
Text-conditional diffusion models are able to generate high-fidelity images with diverse contents. However, linguistic representations frequently exhibit ambiguous descriptions of the envisioned objective imagery. We propose Cocktail, a pipeline to mix various modalities into one embedding.
arXiv Detail & Related papers (2023-06-01T17:55:32Z)
Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models [82.19740045010435]
We introduce Uni-ControlNet, a unified framework that allows for the simultaneous utilization of different local controls and global controls. Unlike existing methods, Uni-ControlNet only requires the fine-tuning of two additional adapters upon frozen pre-trained text-to-image diffusion models. Uni-ControlNet demonstrates its superiority over existing methods in terms of controllability, generation quality and composability.
arXiv Detail & Related papers (2023-05-25T17:59:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.