Composing Parts for Expressive Object Generation
- URL: http://arxiv.org/abs/2406.10197v2
- Date: Sun, 29 Jun 2025 17:42:12 GMT
- Title: Composing Parts for Expressive Object Generation
- Authors: Harsh Rangwani, Aishwarya Agarwal, Kuldeep Kulkarni, R. Venkatesh Babu, Srikrishna Karanam,
- Abstract summary: We introduce PartComposer, a training-free method that enables image generation based on fine-grained part-level attributes.<n>PartComposer localizes object parts by denoising the object region from a specific diffusion process.<n>We run a localized diffusion process in each part region based on fine-grained part attributes and combine them to produce the final image.
- Score: 37.791770942390485
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image composition and generation are processes where the artists need control over various parts of the generated images. However, the current state-of-the-art generation models, like Stable Diffusion, cannot handle fine-grained part-level attributes in the text prompts. Specifically, when additional attribute details are added to the base text prompt, these text-to-image models either generate an image vastly different from the image generated from the base prompt or ignore the attribute details. To mitigate these issues, we introduce PartComposer, a training-free method that enables image generation based on fine-grained part-level attributes specified for objects in the base text prompt. This allows more control for artists and enables novel object compositions by combining distinctive object parts. PartComposer first localizes object parts by denoising the object region from a specific diffusion process. This enables each part token to be localized to the right region. After obtaining part masks, we run a localized diffusion process in each part region based on fine-grained part attributes and combine them to produce the final image. All stages of PartComposer are based on repurposing a pre-trained diffusion model, which enables it to generalize across domains. We demonstrate the effectiveness of part-level control provided by PartComposer through qualitative visual examples and quantitative comparisons with contemporary baselines.
Related papers
- PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models [63.1432721793683]
We introduce PartGen, a novel approach that generates 3D objects composed of meaningful parts starting from text, an image, or an unstructured 3D object.
We evaluate our method on generated and real 3D assets and show that it outperforms segmentation and part-extraction baselines by a large margin.
arXiv Detail & Related papers (2024-12-24T18:59:43Z) - DreamMix: Decoupling Object Attributes for Enhanced Editability in Customized Image Inpainting [56.77074226109392]
We propose DreamMix, a diffusion-based framework adept at inserting target objects into user-specified regions.<n>We show that DreamMix achieves a superior balance between identity preservation and attribute editability across diverse applications.
arXiv Detail & Related papers (2024-11-26T08:44:47Z) - Disentangling Regional Primitives for Image Generation [62.230722004629314]
This paper explains a neural network for image generation from a new perspective.<n>We propose a set of desirable properties to define the representation structure of a neural network for image generation.
arXiv Detail & Related papers (2024-10-06T09:27:45Z) - PartCraft: Crafting Creative Objects by Parts [128.30514851911218]
This paper propels creative control in generative visual AI by allowing users to "select"
We for the first time allow users to choose visual concepts by parts for their creative endeavors.
Fine-grained generation that precisely captures selected visual concepts.
arXiv Detail & Related papers (2024-07-05T15:53:04Z) - Compositional Image Decomposition with Diffusion Models [70.07406583580591]
In this paper, we present a method to decompose an image into such compositional components.
Our approach, Decomp Diffusion, is an unsupervised method which infers a set of different components in the image.
We demonstrate how components can capture different factors of the scene, ranging from global scene descriptors like shadows or facial expression to local scene descriptors like constituent objects.
arXiv Detail & Related papers (2024-06-27T16:13:34Z) - ViFu: Multiple 360$^\circ$ Objects Reconstruction with Clean Background via Visible Part Fusion [7.8788463395442045]
We propose a method to segment and recover a static, clean background and multiple 360$circ$ objects from observations of scenes at different timestamps.
Our basic idea is that, by observing the same set of objects in various arrangement, so that parts that are invisible in one scene may become visible in others.
arXiv Detail & Related papers (2024-04-15T02:44:23Z) - PrimeComposer: Faster Progressively Combined Diffusion for Image Composition with Attention Steering [13.785484396436367]
We formulate image composition as a subject-based local editing task, solely focusing on foreground generation.
We propose PrimeComposer, a faster training-free diffuser that composites the images by well-designed attention steering across different noise levels.
Our method exhibits the fastest inference efficiency and extensive experiments demonstrate our superiority both qualitatively and quantitatively.
arXiv Detail & Related papers (2024-03-08T04:58:49Z) - Divide and Conquer: Language Models can Plan and Self-Correct for
Compositional Text-to-Image Generation [72.6168579583414]
CompAgent is a training-free approach for compositional text-to-image generation with a large language model (LLM) agent as its core.
Our approach achieves more than 10% improvement on T2I-CompBench, a comprehensive benchmark for open-world compositional T2I generation.
arXiv Detail & Related papers (2024-01-28T16:18:39Z) - Leveraging Open-Vocabulary Diffusion to Camouflaged Instance
Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions.
We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z) - LLM Blueprint: Enabling Text-to-Image Generation with Complex and
Detailed Prompts [60.54912319612113]
Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts.
We present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts.
Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models.
arXiv Detail & Related papers (2023-10-16T17:57:37Z) - Zero-shot spatial layout conditioning for text-to-image diffusion models [52.24744018240424]
Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling.
We consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content.
We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models.
arXiv Detail & Related papers (2023-06-23T19:24:48Z) - SIEDOB: Semantic Image Editing by Disentangling Object and Background [5.149242555705579]
We propose a novel paradigm for semantic image editing.
textbfSIEDOB, the core idea of which is to explicitly leverage several heterogeneousworks for objects and backgrounds.
We conduct extensive experiments on Cityscapes and ADE20K-Room datasets and exhibit that our method remarkably outperforms the baselines.
arXiv Detail & Related papers (2023-03-23T06:17:23Z) - DisCoScene: Spatially Disentangled Generative Radiance Fields for
Controllable 3D-aware Scene Synthesis [90.32352050266104]
DisCoScene is a 3Daware generative model for high-quality and controllable scene synthesis.
It disentangles the whole scene into object-centric generative fields by learning on only 2D images with the global-local discrimination.
We demonstrate state-of-the-art performance on many scene datasets, including the challenging outdoor dataset.
arXiv Detail & Related papers (2022-12-22T18:59:59Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - gCoRF: Generative Compositional Radiance Fields [80.45269080324677]
3D generative models of objects enable photorealistic image synthesis with 3D control.
Existing methods model the scene as a global scene representation, ignoring the compositional aspect of the scene.
We present a compositional generative model, where each semantic part of the object is represented as an independent 3D representation.
arXiv Detail & Related papers (2022-10-31T14:10:44Z) - Blended Diffusion for Text-driven Editing of Natural Images [18.664733153082146]
We introduce the first solution for performing local (region-based) edits in generic natural images.
We achieve our goal by leveraging and combining a pretrained language-image model (CLIP)
To seamlessly fuse the edited region with the unchanged parts of the image, we spatially blend noised versions of the input image with the local text-guided diffusion latent.
arXiv Detail & Related papers (2021-11-29T18:58:49Z) - GIRAFFE: Representing Scenes as Compositional Generative Neural Feature
Fields [45.21191307444531]
Deep generative models allow for photorealistic image synthesis at high resolutions.
But for many applications, this is not enough: content creation also needs to be controllable.
Our key hypothesis is that incorporating a compositional 3D scene representation into the generative model leads to more controllable image synthesis.
arXiv Detail & Related papers (2020-11-24T14:14:15Z) - Integrating Image Captioning with Rule-based Entity Masking [23.79124007406315]
We propose a novel framework for the image captioning with an explicit object (e.g., knowledge graph entity) selection process.
The model first explicitly selects which local entities to include in the caption according to a human-interpretable mask, then generate proper captions by attending to selected entities.
arXiv Detail & Related papers (2020-07-22T21:27:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.