MultiShadow: Multi-Object Shadow Generation for Image Compositing via Diffusion Model
- URL: http://arxiv.org/abs/2603.02743v3
- Date: Thu, 05 Mar 2026 02:53:15 GMT
- Title: MultiShadow: Multi-Object Shadow Generation for Image Compositing via Diffusion Model
- Authors: Waqas Ahmed, Dean Diepeveen, Ferdous Sohel,
- Abstract summary: Multi-object shadow generation is crucial for seamless image compositing.<n>In this paper, we aim to synthesize physically plausible shadows for multiple inserted objects.<n>Our approach exploits the multimodal capabilities of a pre-trained text-to-image diffusion model.<n> Experimental results demonstrate that our method achieves state-of-the-art performance in both single and multi-object shadow generation settings.
- Score: 8.660813873416933
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Realistic shadow generation is crucial for achieving seamless image compositing, yet existing methods primarily focus on single-object insertion and often fail to generalize when multiple foreground objects are composited into a background scene. In practice, however, modern compositing pipelines and real-world applications often insert multiple objects simultaneously, necessitating shadows that are jointly consistent in terms of geometry, attachment, and location. In this paper, we address the under-explored problem of multi-object shadow generation, aiming to synthesize physically plausible shadows for multiple inserted objects. Our approach exploits the multimodal capabilities of a pre-trained text-to-image diffusion model. An image pathway injects dense, multi-scale features to provide fine-grained spatial guidance, while a text-based pathway encodes per-object shadow bounding boxes as learned positional tokens and fuses them via cross-attention. An attention-alignment loss further grounds these tokens to their corresponding shadow regions. To support this task, we augment the DESOBAv2 dataset by constructing composite scenes with multiple inserted objects and automatically derive prompts combining object category and shadow positioning information. Experimental results demonstrate that our method achieves state-of-the-art performance in both single and multi-object shadow generation settings.
Related papers
- Joint Shadow Generation and Relighting via Light-Geometry Interaction Maps [51.82696819319878]
We propose Light-Geometry Interaction maps, a novel representation that encodes light-aware occlusion from monocular depth.<n>LGI captures essential light-shadow interactions reliably and accurately, computed from off-the-shelf 2.5D depth map predictions.<n>By embedding LGI into a bridge-matching generative backbone, we reduce ambiguity and enforce physically consistent light-shadow reasoning.
arXiv Detail & Related papers (2026-02-25T11:47:26Z) - PLACID: Identity-Preserving Multi-Object Compositing via Video Diffusion with Synthetic Trajectories [22.63777279327245]
PLACID is a framework that transforms a collection of object images into an appealing multi-object composite.<n>First, we leverage a pretrained image-to-video (I2V) diffusion model with text control to preserve objects consistency, identities, and background details.<n>Second, we propose a novel data curation strategy that generates synthetic sequences where randomly placed objects smoothly move to their target positions.
arXiv Detail & Related papers (2026-01-30T19:42:54Z) - Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition [73.43121650616804]
We propose textbfQwen-Image-Layered, an end-to-end diffusion model that decomposes a single RGB image into multiple semantically disentangled RGBA layers.<n>Our method significantly surpasses existing approaches in decomposition quality and establishes a new paradigm for consistent image editing.
arXiv Detail & Related papers (2025-12-17T17:12:42Z) - FROMAT: Multiview Material Appearance Transfer via Few-Shot Self-Attention Adaptation [49.74776147964999]
We present a lightweight adaptation technique for appearance transfer in multiview diffusion models.<n>Our method learns to combine object identity from an input image with appearance cues rendered in a separate reference image, producing multi-view-consistent output.
arXiv Detail & Related papers (2025-12-10T13:06:40Z) - Controllable Shadow Generation with Single-Step Diffusion Models from Synthetic Data [7.380444448047908]
We introduce a novel method for fast, controllable, and background-free shadow generation for 2D object images.<n>We create a large synthetic dataset using a 3D rendering engine to train a diffusion model for controllable shadow generation.<n>We find that rectified flow objective achieves high-quality results with just a single sampling step enabling real-time applications.
arXiv Detail & Related papers (2024-12-16T16:55:22Z) - Generative Image Layer Decomposition with Visual Effects [49.75021036203426]
LayerDecomp is a generative framework for image layer decomposition.<n>It produces clean backgrounds and high-quality transparent foregrounds with faithfully preserved visual effects.<n>Our method achieves superior quality in layer decomposition, outperforming existing approaches in object removal and spatial editing tasks.
arXiv Detail & Related papers (2024-11-26T20:26:49Z) - Soft-Hard Attention U-Net Model and Benchmark Dataset for Multiscale Image Shadow Removal [2.999888908665659]
This study proposes a novel deep learning architecture, named Soft-Hard Attention U-net (SHAU), focusing on multiscale shadow removal.
It provides a novel synthetic dataset, named Multiscale Shadow Removal dataset (MSRD), containing complex shadow patterns of multiple scales.
The results demonstrate the effectiveness of SHAU over the relevant state-of-the-art shadow removal methods across various benchmark datasets.
arXiv Detail & Related papers (2024-08-07T12:42:06Z) - SwinShadow: Shifted Window for Ambiguous Adjacent Shadow Detection [90.4751446041017]
We present SwinShadow, a transformer-based architecture that fully utilizes the powerful shifted window mechanism for detecting adjacent shadows.
The whole process can be divided into three parts: encoder, decoder, and feature integration.
Experiments on three shadow detection benchmark datasets, SBU, UCF, and ISTD, demonstrate that our network achieves good performance in terms of balance error rate (BER)
arXiv Detail & Related papers (2024-08-07T03:16:33Z) - DESOBAv2: Towards Large-scale Real-world Dataset for Shadow Generation [19.376935979734714]
In this work, we focus on generating plausible shadow for the inserted foreground object to make the composite image more realistic.
To supplement the existing small-scale dataset DESOBA, we create a large-scale dataset called DESOBAv2.
arXiv Detail & Related papers (2023-08-19T10:21:23Z) - ObjectStitch: Generative Object Compositing [43.206123360578665]
We propose a self-supervised framework for object compositing using conditional diffusion models.
Our framework can transform the viewpoint, geometry, color and shadow of the generated object while requiring no manual labeling.
Our method outperforms relevant baselines in both realism and faithfulness of the synthesized result images in a user study on various real-world images.
arXiv Detail & Related papers (2022-12-02T02:15:13Z) - IRISformer: Dense Vision Transformers for Single-Image Inverse Rendering
in Indoor Scenes [99.76677232870192]
We show how a dense vision transformer, IRISformer, excels at both single-task and multi-task reasoning required for inverse rendering.
Specifically, we propose a transformer architecture to simultaneously estimate depths, normals, spatially-varying albedo, roughness and lighting from a single image of an indoor scene.
Our evaluations on benchmark datasets demonstrate state-of-the-art results on each of the above tasks, enabling applications like object insertion and material editing in a single unconstrained real image.
arXiv Detail & Related papers (2022-06-16T19:50:55Z) - Deep Image Compositing [93.75358242750752]
We propose a new method which can automatically generate high-quality image composites without any user input.
Inspired by Laplacian pyramid blending, a dense-connected multi-stream fusion network is proposed to effectively fuse the information from the foreground and background images.
Experiments show that the proposed method can automatically generate high-quality composites and outperforms existing methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2020-11-04T06:12:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.