IMAGHarmony: Controllable Image Editing with Consistent Object Quantity and Layout
- URL: http://arxiv.org/abs/2506.01949v2
- Date: Thu, 09 Oct 2025 17:55:51 GMT
- Title: IMAGHarmony: Controllable Image Editing with Consistent Object Quantity and Layout
- Authors: Fei Shen, Yutong Gao, Jian Yu, Xiaoyu Du, Jinhui Tang,
- Abstract summary: We study quantity and layout consistent image editing, abbreviated as QL-Edit, in multi-object scenes.<n>We present IMAGHarmony, a framework featuring a plug-and-play harmony aware (HA) module that fuses perception semantics while modeling object counts and locations.<n>We also present a preference-guided noise selection (PNS) strategy that selects semantically aligned initial noise through vision and language matching.
- Score: 36.70548378032599
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent diffusion models have advanced image editing by improving fidelity and controllability across creative and personalized applications. However, multi-object scenes remain challenging, as reliable control over object categories, counts, and spatial layout is difficult to achieve. For that, we first study quantity and layout consistent image editing, abbreviated as QL-Edit, which targets control of object quantity and spatial layout in multi-object scenes. Then, we present IMAGHarmony, a straightforward framework featuring a plug-and-play harmony aware (HA) module that fuses perception semantics while modeling object counts and locations, resulting in accurate edits and strong structural consistency. We further observe that diffusion models are sensitive to the choice of initial noise and tend to prefer certain noise patterns. Based on this finding, we present a preference-guided noise selection (PNS) strategy that selects semantically aligned initial noise through vision and language matching, thereby further improving generation stability and layout consistency in multiple object editing. To support evaluation, we develop HarmonyBench, a comprehensive benchmark that covers a diverse range of quantity and layout control scenarios. Extensive experiments demonstrate that IMAGHarmony outperforms prior methods in both structural alignment and semantic accuracy, utilizing only 200 training images and 10.6M of trainable parameters. Code, models, and data are available at https://github.com/muzishen/IMAGHarmony.
Related papers
- Towards Source-Aware Object Swapping with Initial Noise Perturbation [10.974803680416876]
SourceSwap is a self-supervised and source-aware framework that learns cross-object alignment.<n>We train a dual U-Net with full-source conditioning and a noise-free reference encoder, enabling direct inter-object alignment.<n> Experiments demonstrate that SourceSwap achieves superior fidelity, stronger scene preservation, and more natural harmony.
arXiv Detail & Related papers (2026-02-27T05:54:29Z) - Training-Free Multi-Concept Image Editing [14.75123947134721]
Concept Distillation Sampling (CDS) is a training-free framework for target-less, multi-concept image editing.<n>It overcomes the linguistic bottleneck of previous methods by integrating a highly stable distillation backbone.<n>Our method preserves instance fidelity without requiring reference samples of the desired edit.
arXiv Detail & Related papers (2026-02-24T12:27:51Z) - iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation [60.66986667921744]
iMontage is a unified framework designed to repurpose a powerful video model into an all-in-one image generator.<n>We propose an elegant and minimally invasive adaptation strategy, complemented by a tailored data curation process and training paradigm.<n>This approach allows the model to acquire broad image manipulation capabilities without corrupting its invaluable original motion priors.
arXiv Detail & Related papers (2025-11-25T18:54:16Z) - LORE: Latent Optimization for Precise Semantic Control in Rectified Flow-based Image Editing [0.276240219662896]
We introduce LORE, a training-free and efficient image editing method.<n>LORE directly optimize the inverted noise, addressing the core limitations in generalization and controllability of existing approaches.<n> Experimental results show that LORE significantly outperforms strong baselines in terms of semantic alignment, image quality, and background fidelity.
arXiv Detail & Related papers (2025-08-05T06:45:04Z) - LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer [32.9330637921386]
LAMIC is a Layout-Aware Multi-Image Composition framework.<n>It extends single-reference diffusion models to multi-reference scenarios in a training-free manner.<n>It consistently outperforms existing multi-reference baselines in ID-S, BG-S, IN-R and AVG scores across all settings.
arXiv Detail & Related papers (2025-08-01T09:51:54Z) - Marmot: Multi-Agent Reasoning for Multi-Object Self-Correcting in Improving Image-Text Alignment [55.74860093731475]
Marmot is a novel framework that employs Multi-Agent Reasoning for Multi-Object Self-Correcting.<n>We construct a multi-agent self-correcting system featuring a decision-execution-verification mechanism.<n>Experiments demonstrate that Marmot significantly improves accuracy in object counting, attribute assignment, and spatial relationships.
arXiv Detail & Related papers (2025-04-10T16:54:28Z) - DefMamba: Deformable Visual State Space Model [65.50381013020248]
We propose a novel visual foundation model called DefMamba.<n>By combining a deformable scanning(DS) strategy, this model significantly improves its ability to learn image structures and detects changes in object details.<n>Numerous experiments have shown that DefMamba achieves state-of-the-art performance in various visual tasks.
arXiv Detail & Related papers (2025-04-08T08:22:54Z) - STAY Diffusion: Styled Layout Diffusion Model for Diverse Layout-to-Image Generation [4.769823364778397]
We propose a diffusion-based model that produces photo-realistic images and provides fine-grained control of stylized objects in scenes.<n>Our approach learns a global condition for each layout, and a self-supervised semantic map for weight modulation.<n>A new Styled-Mask Attention (SM Attention) is also introduced to cross-condition the global condition and image feature for capturing the objects' relationships.
arXiv Detail & Related papers (2025-03-15T17:36:24Z) - PSF-4D: A Progressive Sampling Framework for View Consistent 4D Editing [10.331089974537873]
We introduce a progressive sampling framework for 4D editing (PSF-4D)<n>For temporal coherence, we design a correlated Gaussian noise structure that links frames over time.<n>For spatial consistency across views, we implement a cross-view noise model.<n>Our approach enables high-quality 4D editing without relying on external models.
arXiv Detail & Related papers (2025-03-14T03:16:42Z) - EditAR: Unified Conditional Generation with Autoregressive Models [58.093860528672735]
We propose EditAR, a single unified autoregressive framework for a variety of conditional image generation tasks.<n>The model takes both images and instructions as inputs, and predicts the edited images tokens in a vanilla next-token paradigm.<n>We evaluate its effectiveness across diverse tasks on established benchmarks, showing competitive performance to various state-of-the-art task-specific methods.
arXiv Detail & Related papers (2025-01-08T18:59:35Z) - Edicho: Consistent Image Editing in the Wild [90.42395533938915]
Edicho steps in with a training-free solution based on diffusion models.<n>It features a fundamental design principle of using explicit image correspondence to direct editing.
arXiv Detail & Related papers (2024-12-30T16:56:44Z) - Diffusion Model-Based Image Editing: A Survey [46.244266782108234]
Denoising diffusion models have emerged as a powerful tool for various image generation and editing tasks.<n>We provide an exhaustive overview of existing methods using diffusion models for image editing.<n>To further evaluate the performance of text-guided image editing algorithms, we propose a systematic benchmark, EditEval.
arXiv Detail & Related papers (2024-02-27T14:07:09Z) - CM-GAN: Image Inpainting with Cascaded Modulation GAN and Object-Aware
Training [112.96224800952724]
We propose cascaded modulation GAN (CM-GAN) to generate plausible image structures when dealing with large holes in complex images.
In each decoder block, global modulation is first applied to perform coarse semantic-aware synthesis structure, then spatial modulation is applied on the output of global modulation to further adjust the feature map in a spatially adaptive fashion.
In addition, we design an object-aware training scheme to prevent the network from hallucinating new objects inside holes, fulfilling the needs of object removal tasks in real-world scenarios.
arXiv Detail & Related papers (2022-03-22T16:13:27Z) - Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets.
This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets.
We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z) - TSIT: A Simple and Versatile Framework for Image-to-Image Translation [103.92203013154403]
We introduce a simple and versatile framework for image-to-image translation.
We provide a carefully designed two-stream generative model with newly proposed feature transformations.
This allows multi-scale semantic structure information and style representation to be effectively captured and fused by the network.
A systematic study compares the proposed method with several state-of-the-art task-specific baselines, verifying its effectiveness in both perceptual quality and quantitative evaluations.
arXiv Detail & Related papers (2020-07-23T15:34:06Z) - Object-Centric Image Generation from Layouts [93.10217725729468]
We develop a layout-to-image-generation method to generate complex scenes with multiple objects.
Our method learns representations of the spatial relationships between objects in the scene, which lead to our model's improved layout-fidelity.
We introduce SceneFID, an object-centric adaptation of the popular Fr'echet Inception Distance metric, that is better suited for multi-object images.
arXiv Detail & Related papers (2020-03-16T21:40:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.