Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion
- URL: http://arxiv.org/abs/2412.14462v1
- Date: Thu, 19 Dec 2024 02:23:13 GMT
- Title: Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion
- Authors: Jixuan He, Wanhua Li, Ye Liu, Junsik Kim, Donglai Wei, Hanspeter Pfister,
- Abstract summary: We extend the concept of Affordance from human-centered image composition tasks to a more general object-scene composition framework.<n>We propose a Mask-Aware Dual Diffusion (MADD) model, which utilizes a dual-stream architecture to simultaneously denoise the RGB image and the insertion mask.<n>Our method outperforms the state-of-the-art methods and exhibits strong generalization performance on in-the-wild images.
- Score: 29.770096013143117
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As a common image editing operation, image composition involves integrating foreground objects into background scenes. In this paper, we expand the application of the concept of Affordance from human-centered image composition tasks to a more general object-scene composition framework, addressing the complex interplay between foreground objects and background scenes. Following the principle of Affordance, we define the affordance-aware object insertion task, which aims to seamlessly insert any object into any scene with various position prompts. To address the limited data issue and incorporate this task, we constructed the SAM-FB dataset, which contains over 3 million examples across more than 3,000 object categories. Furthermore, we propose the Mask-Aware Dual Diffusion (MADD) model, which utilizes a dual-stream architecture to simultaneously denoise the RGB image and the insertion mask. By explicitly modeling the insertion mask in the diffusion process, MADD effectively facilitates the notion of affordance. Extensive experimental results show that our method outperforms the state-of-the-art methods and exhibits strong generalization performance on in-the-wild images. Please refer to our code on https://github.com/KaKituken/affordance-aware-any.
Related papers
- Insert Anything: Image Insertion via In-Context Editing in DiT [19.733787045511775]
We present a unified framework for reference-based image insertion that seamlessly integrates objects from reference images into target scenes under flexible, user-specified control guidance.
Our approach is trained once on our new AnyInsertion dataset--comprising 120K prompt-image pairs covering diverse tasks such as person, object, and garment insertion--and effortlessly generalizes to a wide range of insertion scenarios.
arXiv Detail & Related papers (2025-04-21T10:19:12Z) - HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation [50.206100327643284]
HiScene is a novel hierarchical framework that bridges the gap between 2D image generation and 3D object generation.
We generate 3D content that aligns with 2D representations while maintaining compositional structure.
arXiv Detail & Related papers (2025-04-17T16:33:39Z) - A Diffusion-Based Framework for Occluded Object Movement [39.6345172890042]
We propose a Diffusion-based framework specifically designed for Occluded Object Movement, named DiffOOM.
The de-occlusion branch utilizes a background color-fill strategy and a continuously updated object mask to focus the diffusion process on completing the obscured portion of the target object.
Concurrently, the movement branch employs latent optimization to place the completed object in the target location and adopts local text-conditioned guidance to integrate the object into new surroundings appropriately.
arXiv Detail & Related papers (2025-04-02T16:29:30Z) - ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation [33.91045409317844]
This paper introduces a tuning-free method for both object insertion and subject-driven generation.<n>The task involves composing an object, given multiple views, into a scene specified by either an image or text.<n>We compare our method with state-of-the-art methods for object insertion and subject-driven generation, using a single or multiple references.
arXiv Detail & Related papers (2024-12-11T18:59:53Z) - Improving Text-guided Object Inpainting with Semantic Pre-inpainting [95.17396565347936]
We decompose the typical single-stage object inpainting into two cascaded processes: semantic pre-inpainting and high-fieldity object generation.
To achieve this, we cascade a Transformer-based semantic inpainter and an object inpainting diffusion model, leading to a novel CAscaded Transformer-Diffusion framework.
arXiv Detail & Related papers (2024-09-12T17:55:37Z) - Pluralistic Salient Object Detection [108.74650817891984]
We introduce pluralistic salient object detection (PSOD), a novel task aimed at generating multiple plausible salient segmentation results for a given input image.
We present two new SOD datasets "DUTS-MM" and "DUS-MQ", along with newly designed evaluation metrics.
arXiv Detail & Related papers (2024-09-04T01:38:37Z) - Completing Visual Objects via Bridging Generation and Segmentation [84.4552458720467]
MaskComp delineates the completion process through iterative stages of generation and segmentation.
In each iteration, the object mask is provided as an additional condition to boost image generation.
We demonstrate that the combination of one generation and one segmentation stage effectively functions as a mask denoiser.
arXiv Detail & Related papers (2023-10-01T22:25:40Z) - DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing [94.24479528298252]
DragGAN is an interactive point-based image editing framework that achieves impressive editing results with pixel-level precision.
By harnessing large-scale pretrained diffusion models, we greatly enhance the applicability of interactive point-based editing on both real and diffusion-generated images.
We present a challenging benchmark dataset called DragBench to evaluate the performance of interactive point-based image editing methods.
arXiv Detail & Related papers (2023-06-26T06:04:09Z) - AMICO: Amodal Instance Composition [40.03865667370814]
Image composition aims to blend multiple objects to form a harmonized image.
We present Amodal Instance Composition for blending imperfect objects onto a target image.
Our results show state-of-the-art performance on public COCOA and KINS benchmarks.
arXiv Detail & Related papers (2022-10-11T23:23:14Z) - Exploring the Interactive Guidance for Unified and Effective Image
Matting [16.933897631478146]
We propose a Unified Interactive image Matting method, named UIM, which solves the limitations and achieves satisfying matting results.
Specifically, UIM leverages multiple types of user interaction to avoid the ambiguity of multiple matting targets.
We show that UIM achieves state-of-the-art performance on the Composition-1K test set and a synthetic unified dataset.
arXiv Detail & Related papers (2022-05-17T13:20:30Z) - LayoutBERT: Masked Language Layout Model for Object Insertion [3.4806267677524896]
We propose layoutBERT for the object insertion task.
It uses a novel self-supervised masked language model objective and bidirectional multi-head self-attention.
We provide both qualitative and quantitative evaluations on datasets from diverse domains.
arXiv Detail & Related papers (2022-04-30T21:35:38Z) - Generating Object Stamps [47.20601520671103]
We present an algorithm to generate diverse foreground objects and composite them into background images using a GAN architecture.
Our results on the challenging COCO dataset show improved overall quality and diversity compared to state-of-the-art object insertion approaches.
arXiv Detail & Related papers (2020-01-01T14:36:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.