DreamFuse: Adaptive Image Fusion with Diffusion Transformer
- URL: http://arxiv.org/abs/2504.08291v1
- Date: Fri, 11 Apr 2025 06:49:33 GMT
- Title: DreamFuse: Adaptive Image Fusion with Diffusion Transformer
- Authors: Junjia Huang, Pengxiang Yan, Jiyang Liu, Jie Wu, Zhao Wang, Yitong Wang, Liang Lin, Guanbin Li,
- Abstract summary: We propose an iterative human-in-the-loop data generation pipeline to generate image fusion datasets.<n>We introduce DreamFuse, a novel approach to generate consistent and fused images with both foreground and background information.<n>DreamFuse achieves harmonious fusion while generalizing to text-driven editing of the fused results.
- Score: 85.69622262632066
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image fusion seeks to seamlessly integrate foreground objects with background scenes, producing realistic and harmonious fused images. Unlike existing methods that directly insert objects into the background, adaptive and interactive fusion remains a challenging yet appealing task. It requires the foreground to adjust or interact with the background context, enabling more coherent integration. To address this, we propose an iterative human-in-the-loop data generation pipeline, which leverages limited initial data with diverse textual prompts to generate fusion datasets across various scenarios and interactions, including placement, holding, wearing, and style transfer. Building on this, we introduce DreamFuse, a novel approach based on the Diffusion Transformer (DiT) model, to generate consistent and harmonious fused images with both foreground and background information. DreamFuse employs a Positional Affine mechanism to inject the size and position of the foreground into the background, enabling effective foreground-background interaction through shared attention. Furthermore, we apply Localized Direct Preference Optimization guided by human feedback to refine DreamFuse, enhancing background consistency and foreground harmony. DreamFuse achieves harmonious fusion while generalizing to text-driven attribute editing of the fused results. Experimental results demonstrate that our method outperforms state-of-the-art approaches across multiple metrics.
Related papers
- Text-DiFuse: An Interactive Multi-Modal Image Fusion Framework based on Text-modulated Diffusion Model [30.739879255847946]
Existing multi-modal image fusion methods fail to address the compound degradations presented in source images.
This study proposes a novel interactive multi-modal image fusion framework based on the text-modulated diffusion model, called Text-DiFuse.
arXiv Detail & Related papers (2024-10-31T13:10:50Z) - Fusion from Decomposition: A Self-Supervised Approach for Image Fusion and Beyond [74.96466744512992]
The essence of image fusion is to integrate complementary information from source images.
DeFusion++ produces versatile fused representations that can enhance the quality of image fusion and the effectiveness of downstream high-level vision tasks.
arXiv Detail & Related papers (2024-10-16T06:28:49Z) - DreamMover: Leveraging the Prior of Diffusion Models for Image Interpolation with Large Motion [35.60459492849359]
We study the problem of generating intermediate images from image pairs with large motion.
Due to the large motion, the intermediate semantic information may be absent in input images.
We propose DreamMover, a novel image framework with three main components.
arXiv Detail & Related papers (2024-09-15T04:09:12Z) - From Text to Pixels: A Context-Aware Semantic Synergy Solution for
Infrared and Visible Image Fusion [66.33467192279514]
We introduce a text-guided multi-modality image fusion method that leverages the high-level semantics from textual descriptions to integrate semantics from infrared and visible images.
Our method not only produces visually superior fusion results but also achieves a higher detection mAP over existing methods, achieving state-of-the-art results.
arXiv Detail & Related papers (2023-12-31T08:13:47Z) - TextFusion: Unveiling the Power of Textual Semantics for Controllable
Image Fusion [38.61215361212626]
We propose a text-guided fusion paradigm for advanced image fusion.
We release a text-annotated image fusion dataset IVT.
Our approach consistently outperforms traditional appearance-based fusion methods.
arXiv Detail & Related papers (2023-12-21T09:25:10Z) - Deep Image Harmonization with Globally Guided Feature Transformation and
Relation Distillation [20.302430505018]
We show that using global information to guide foreground feature transformation could achieve significant improvement.
We also propose to transfer the foreground-background relation from real images to composite images, which can provide intermediate supervision for the transformed encoder features.
arXiv Detail & Related papers (2023-08-01T07:53:25Z) - DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing [94.24479528298252]
DragGAN is an interactive point-based image editing framework that achieves impressive editing results with pixel-level precision.
By harnessing large-scale pretrained diffusion models, we greatly enhance the applicability of interactive point-based editing on both real and diffusion-generated images.
We present a challenging benchmark dataset called DragBench to evaluate the performance of interactive point-based image editing methods.
arXiv Detail & Related papers (2023-06-26T06:04:09Z) - A Task-guided, Implicitly-searched and Meta-initialized Deep Model for
Image Fusion [69.10255211811007]
We present a Task-guided, Implicit-searched and Meta- generalizationd (TIM) deep model to address the image fusion problem in a challenging real-world scenario.
Specifically, we propose a constrained strategy to incorporate information from downstream tasks to guide the unsupervised learning process of image fusion.
Within this framework, we then design an implicit search scheme to automatically discover compact architectures for our fusion model with high efficiency.
arXiv Detail & Related papers (2023-05-25T08:54:08Z) - Voxel Field Fusion for 3D Object Detection [140.6941303279114]
We present a conceptually simple framework for cross-modality 3D object detection, named voxel field fusion.
The proposed approach aims to maintain cross-modality consistency by representing and fusing augmented image features as a ray in the voxel field.
The framework is demonstrated to achieve consistent gains in various benchmarks and outperforms previous fusion-based methods on KITTI and nuScenes datasets.
arXiv Detail & Related papers (2022-05-31T16:31:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.