Compass Control: Multi Object Orientation Control for Text-to-Image Generation
- URL: http://arxiv.org/abs/2504.06752v2
- Date: Thu, 10 Apr 2025 04:59:11 GMT
- Title: Compass Control: Multi Object Orientation Control for Text-to-Image Generation
- Authors: Rishubh Parihar, Vaibhav Agrawal, Sachidanand VS, R. Venkatesh Babu,
- Abstract summary: Existing approaches for controlling text-to-image diffusion models, while powerful, do not allow for explicit 3D object-centric control.<n>We address the problem of multi-object orientation control in text-to-image diffusion models.<n>This enables the generation of diverse multi-object scenes with precise orientation control for each object.
- Score: 24.4172525865806
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing approaches for controlling text-to-image diffusion models, while powerful, do not allow for explicit 3D object-centric control, such as precise control of object orientation. In this work, we address the problem of multi-object orientation control in text-to-image diffusion models. This enables the generation of diverse multi-object scenes with precise orientation control for each object. The key idea is to condition the diffusion model with a set of orientation-aware \textbf{compass} tokens, one for each object, along with text tokens. A light-weight encoder network predicts these compass tokens taking object orientation as the input. The model is trained on a synthetic dataset of procedurally generated scenes, each containing one or two 3D assets on a plain background. However, direct training this framework results in poor orientation control as well as leads to entanglement among objects. To mitigate this, we intervene in the generation process and constrain the cross-attention maps of each compass token to its corresponding object regions. The trained model is able to achieve precise orientation control for a) complex objects not seen during training and b) multi-object scenes with more than two objects, indicating strong generalization capabilities. Further, when combined with personalization methods, our method precisely controls the orientation of the new object in diverse contexts. Our method achieves state-of-the-art orientation control and text alignment, quantified with extensive evaluations and a user study.
Related papers
- CTRL-O: Language-Controllable Object-Centric Visual Representation Learning [30.218743514199016]
Object-centric representation learning aims to decompose visual scenes into fixed-size vectors called "slots" or "object files"<n>Current object-centric models learn representations based on their preconceived understanding of objects, without allowing user input to guide which objects are represented.<n>We propose a novel approach for user-directed control over slot representations by conditioning slots on language descriptions.
arXiv Detail & Related papers (2025-03-27T17:53:50Z) - Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models [79.96917782423219]
Orient Anything is the first expert and foundational model designed to estimate object orientation in a single image.<n>By developing a pipeline to annotate the front face of 3D objects, we collect 2M images with precise orientation annotations.<n>Our model achieves state-of-the-art orientation estimation accuracy in both rendered and real images.
arXiv Detail & Related papers (2024-12-24T18:58:43Z) - Customizing Text-to-Image Diffusion with Object Viewpoint Control [53.621518249820745]
We introduce a new task -- enabling explicit control of the object viewpoint in the customization of text-to-image diffusion models.<n>This allows us to modify the custom object's properties and generate it in various background scenes via text prompts.<n>We propose to condition the diffusion process on the 3D object features rendered from the target viewpoint.
arXiv Detail & Related papers (2024-04-18T16:59:51Z) - GRA: Detecting Oriented Objects through Group-wise Rotating and Attention [64.21917568525764]
Group-wise Rotating and Attention (GRA) module is proposed to replace the convolution operations in backbone networks for oriented object detection.
GRA can adaptively capture fine-grained features of objects with diverse orientations, comprising two key components: Group-wise Rotating and Group-wise Attention.
GRA achieves a new state-of-the-art (SOTA) on the DOTA-v2.0 benchmark, while saving the parameters by nearly 50% compared to the previous SOTA method.
arXiv Detail & Related papers (2024-03-17T07:29:32Z) - Fine-grained Controllable Video Generation via Object Appearance and
Context [74.23066823064575]
We propose fine-grained controllable video generation (FACTOR) to achieve detailed control.
FACTOR aims to control objects' appearances and context, including their location and category.
Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users.
arXiv Detail & Related papers (2023-12-05T17:47:33Z) - SOOD: Towards Semi-Supervised Oriented Object Detection [57.05141794402972]
This paper proposes a novel Semi-supervised Oriented Object Detection model, termed SOOD, built upon the mainstream pseudo-labeling framework.
Our experiments show that when trained with the two proposed losses, SOOD surpasses the state-of-the-art SSOD methods under various settings on the DOTA-v1.5 benchmark.
arXiv Detail & Related papers (2023-04-10T11:10:42Z) - Multi-Projection Fusion and Refinement Network for Salient Object
Detection in 360{\deg} Omnidirectional Image [141.10227079090419]
We propose a Multi-Projection Fusion and Refinement Network (MPFR-Net) to detect the salient objects in 360deg omnidirectional image.
MPFR-Net uses the equirectangular projection image and four corresponding cube-unfolding images as inputs.
Experimental results on two omnidirectional datasets demonstrate that the proposed approach outperforms the state-of-the-art methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2022-12-23T14:50:40Z) - A Simple Baseline for Multi-Camera 3D Object Detection [94.63944826540491]
3D object detection with surrounding cameras has been a promising direction for autonomous driving.
We present SimMOD, a Simple baseline for Multi-camera Object Detection.
We conduct extensive experiments on the 3D object detection benchmark of nuScenes to demonstrate the effectiveness of SimMOD.
arXiv Detail & Related papers (2022-08-22T03:38:01Z) - Orienting Novel 3D Objects Using Self-Supervised Learning of Rotation
Transforms [22.91890127146324]
Orienting objects is a critical component in the automation of many packing and assembly tasks.
We train a deep neural network to estimate the 3D rotation as parameterized by a quaternion.
We then use the trained network in a proportional controller to re-orient objects based on the estimated rotation between the two depth images.
arXiv Detail & Related papers (2021-05-29T08:22:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.