PLACID: Identity-Preserving Multi-Object Compositing via Video Diffusion with Synthetic Trajectories
- URL: http://arxiv.org/abs/2602.00267v1
- Date: Fri, 30 Jan 2026 19:42:54 GMT
- Title: PLACID: Identity-Preserving Multi-Object Compositing via Video Diffusion with Synthetic Trajectories
- Authors: Gemma Canet Tarrés, Manel Baradad, Francesc Moreno-Noguer, Yumeng Li,
- Abstract summary: PLACID is a framework that transforms a collection of object images into an appealing multi-object composite.<n>First, we leverage a pretrained image-to-video (I2V) diffusion model with text control to preserve objects consistency, identities, and background details.<n>Second, we propose a novel data curation strategy that generates synthetic sequences where randomly placed objects smoothly move to their target positions.
- Score: 22.63777279327245
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in generative AI have dramatically improved photorealistic image synthesis, yet they fall short for studio-level multi-object compositing. This task demands simultaneous (i) near-perfect preservation of each item's identity, (ii) precise background and color fidelity, (iii) layout and design elements control, and (iv) complete, appealing displays showcasing all objects. However, current state-of-the-art models often alter object details, omit or duplicate objects, and produce layouts with incorrect relative sizing or inconsistent item presentations. To bridge this gap, we introduce PLACID, a framework that transforms a collection of object images into an appealing multi-object composite. Our approach makes two main contributions. First, we leverage a pretrained image-to-video (I2V) diffusion model with text control to preserve objects consistency, identities, and background details by exploiting temporal priors from videos. Second, we propose a novel data curation strategy that generates synthetic sequences where randomly placed objects smoothly move to their target positions. This synthetic data aligns with the video model's temporal priors during training. At inference, objects initialized at random positions consistently converge into coherent layouts guided by text, with the final frame serving as the composite image. Extensive quantitative evaluations and user studies demonstrate that PLACID surpasses state-of-the-art methods in multi-object compositing, achieving superior identity, background, and color preservation, with less omitted objects and visually appealing results.
Related papers
- Learning Object-Centric Representations Based on Slots in Real World Scenarios [5.922488908114023]
This thesis introduces a framework that adapts powerful pretrained diffusion models for object-centric synthesis.<n>We identify a core challenge: balancing global scene coherence with disentangled object control.<n>Our method integrates lightweight, slot-based conditioning into pretrained models, preserving their visual priors while providing object-specific manipulation.
arXiv Detail & Related papers (2025-09-29T12:01:49Z) - ComposeAnything: Composite Object Priors for Text-to-Image Generation [72.98469853839246]
ComposeAnything is a novel framework for improving compositional image generation without retraining existing T2I models.<n>Our approach first leverages the chain-of-thought reasoning abilities of LLMs to produce 2.5D semantic layouts from text.<n>Our model generates high-quality images with compositions that faithfully reflect the text.
arXiv Detail & Related papers (2025-05-30T00:13:36Z) - HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation [50.206100327643284]
HiScene is a novel hierarchical framework that bridges the gap between 2D image generation and 3D object generation.<n>We generate 3D content that aligns with 2D representations while maintaining compositional structure.
arXiv Detail & Related papers (2025-04-17T16:33:39Z) - IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation [40.34581973675213]
IMPRINT is a novel diffusion-based generative model trained with a two-stage learning framework.
The first stage is targeted for context-agnostic, identity-preserving pretraining of the object encoder.
The second stage leverages this representation to learn seamless harmonization of the object composited to the background.
arXiv Detail & Related papers (2024-03-15T21:37:04Z) - LLM Blueprint: Enabling Text-to-Image Generation with Complex and
Detailed Prompts [60.54912319612113]
Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts.
We present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts.
Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models.
arXiv Detail & Related papers (2023-10-16T17:57:37Z) - Taming Encoder for Zero Fine-tuning Image Customization with
Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users.
The method is based on a general framework that bypasses the lengthy optimization required by previous approaches.
We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z) - ObjectStitch: Generative Object Compositing [43.206123360578665]
We propose a self-supervised framework for object compositing using conditional diffusion models.
Our framework can transform the viewpoint, geometry, color and shadow of the generated object while requiring no manual labeling.
Our method outperforms relevant baselines in both realism and faithfulness of the synthesized result images in a user study on various real-world images.
arXiv Detail & Related papers (2022-12-02T02:15:13Z) - Object-aware Contrastive Learning for Debiased Scene Representation [74.30741492814327]
We develop a novel object-aware contrastive learning framework that localizes objects in a self-supervised manner.
We also introduce two data augmentations based on ContraCAM, object-aware random crop and background mixup, which reduce contextual and background biases during contrastive self-supervised learning.
arXiv Detail & Related papers (2021-07-30T19:24:07Z) - Object-Centric Image Generation from Layouts [93.10217725729468]
We develop a layout-to-image-generation method to generate complex scenes with multiple objects.
Our method learns representations of the spatial relationships between objects in the scene, which lead to our model's improved layout-fidelity.
We introduce SceneFID, an object-centric adaptation of the popular Fr'echet Inception Distance metric, that is better suited for multi-object images.
arXiv Detail & Related papers (2020-03-16T21:40:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.