Proportion and Perspective Control for Flow-Based Image Generation
- URL: http://arxiv.org/abs/2510.21763v1
- Date: Mon, 13 Oct 2025 13:34:28 GMT
- Title: Proportion and Perspective Control for Flow-Based Image Generation
- Authors: Julien Boudier, Hugo Caselles-Dupré,
- Abstract summary: We introduce and evaluate two ControlNets specialized for artistic control.<n>A proportion ControlNet uses bounding boxes to dictate the position and scale of objects, and a perspective ControlNet employs vanishing lines to control the 3D geometry of the scene.
- Score: 1.9336815376402718
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While modern text-to-image diffusion models generate high-fidelity images, they offer limited control over the spatial and geometric structure of the output. To address this, we introduce and evaluate two ControlNets specialized for artistic control: (1) a proportion ControlNet that uses bounding boxes to dictate the position and scale of objects, and (2) a perspective ControlNet that employs vanishing lines to control the 3D geometry of the scene. We support the training of these modules with data pipelines that leverage vision-language models for annotation and specialized algorithms for conditioning image synthesis. Our experiments demonstrate that both modules provide effective control but exhibit limitations with complex constraints. Both models are released on HuggingFace: https://huggingface.co/obvious-research
Related papers
- SpaceControl: Introducing Test-Time Spatial Control to 3D Generative Modeling [62.89824987879374]
We introduce SpaceControl, a training-free test-time method for explicit spatial control of 3D generation.<n>SpaceControl integrates seamlessly with modern pre-trained generative models without requiring any additional training.<n>We present an interactive user interface that enables online editing of superquadrics for direct conversion into textured 3D assets.
arXiv Detail & Related papers (2025-12-05T00:54:48Z) - A Two-Stage System for Layout-Controlled Image Generation using Large Language Models and Diffusion Models [0.0]
Text-to-image diffusion models exhibit remarkable generative capabilities, but lack precise control over object counts and spatial arrangements.<n>This work introduces a two-stage system to address these compositional limitations.<n>The first stage employs a Large Language Model (LLM) to generate a structured layout from a list of objects.<n>The second stage uses a layout-conditioned diffusion model to synthesize a photorealistic image adhering to this layout.
arXiv Detail & Related papers (2025-11-10T09:40:48Z) - DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation [63.781450025764904]
We propose DynamiCtrl, a novel framework for human animation in video DiT architecture.<n>We use a shared VAE encoder for human images and driving poses, unifying them into a common latent space.<n>We also introduce the "Joint-text" paradigm, which preserves the role of text embeddings to provide global semantic context.
arXiv Detail & Related papers (2025-03-27T08:07:45Z) - Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance [36.50036055679903]
Recent controllable generation approaches bring fine-grained spatial and appearance control to text-to-image (T2I) diffusion models without training auxiliary modules.<n>This work presents Ctrl-X, a simple framework for T2I diffusion controlling structure and appearance without additional training or guidance.
arXiv Detail & Related papers (2024-06-11T17:59:01Z) - FreeControl: Training-Free Spatial Control of Any Text-to-Image
Diffusion Model with Any Condition [41.92032568474062]
FreeControl is a training-free approach for controllable T2I generation.
It supports multiple conditions, architectures, and checkpoints simultaneously.
It achieves competitive synthesis quality with training-based approaches.
arXiv Detail & Related papers (2023-12-12T18:59:14Z) - ControlNet-XS: Rethinking the Control of Text-to-Image Diffusion Models as Feedback-Control Systems [19.02295657801464]
In this work, we take an existing controlling network (ControlNet) and change the communication between the controlling network and the generation process to be of high-frequency and with large-bandwidth.
We outperform state-of-the-art approaches for pixel-level guidance, such as depth, canny-edges, and semantic segmentation, and are on a par for loose keypoint-guidance of human poses.
All code and pre-trained models will be made publicly available.
arXiv Detail & Related papers (2023-12-11T17:58:06Z) - Fine-grained Controllable Video Generation via Object Appearance and
Context [74.23066823064575]
We propose fine-grained controllable video generation (FACTOR) to achieve detailed control.
FACTOR aims to control objects' appearances and context, including their location and category.
Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users.
arXiv Detail & Related papers (2023-12-05T17:47:33Z) - Text2Control3D: Controllable 3D Avatar Generation in Neural Radiance
Fields using Geometry-Guided Text-to-Image Diffusion Model [39.64952340472541]
We propose a controllable text-to-3D avatar generation method whose facial expression is controllable.
Our main strategy is to construct the 3D avatar in Neural Radiance Fields (NeRF) optimized with a set of controlled viewpoint-aware images.
We demonstrate the empirical results and discuss the effectiveness of our method.
arXiv Detail & Related papers (2023-09-07T08:14:46Z) - DragNUWA: Fine-grained Control in Video Generation by Integrating Text,
Image, and Trajectory [126.4597063554213]
DragNUWA is an open-domain diffusion-based video generation model.
It provides fine-grained control over video content from semantic, spatial, and temporal perspectives.
Our experiments validate the effectiveness of DragNUWA, demonstrating its superior performance in fine-grained control in video generation.
arXiv Detail & Related papers (2023-08-16T01:43:41Z) - Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models [82.19740045010435]
We introduce Uni-ControlNet, a unified framework that allows for the simultaneous utilization of different local controls and global controls.
Unlike existing methods, Uni-ControlNet only requires the fine-tuning of two additional adapters upon frozen pre-trained text-to-image diffusion models.
Uni-ControlNet demonstrates its superiority over existing methods in terms of controllability, generation quality and composability.
arXiv Detail & Related papers (2023-05-25T17:59:58Z) - UniControl: A Unified Diffusion Model for Controllable Visual Generation
In the Wild [166.25327094261038]
We introduce UniControl, a new generative foundation model for controllable condition-to-image (C2I) tasks.
UniControl consolidates a wide array of C2I tasks within a singular framework, while still allowing for arbitrary language prompts.
trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities.
arXiv Detail & Related papers (2023-05-18T17:41:34Z) - Towards a Neural Graphics Pipeline for Controllable Image Generation [96.11791992084551]
We present Neural Graphics Pipeline (NGP), a hybrid generative model that brings together neural and traditional image formation models.
NGP decomposes the image into a set of interpretable appearance feature maps, uncovering direct control handles for controllable image generation.
We demonstrate the effectiveness of our approach on controllable image generation of single-object scenes.
arXiv Detail & Related papers (2020-06-18T14:22:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.