Stitch: Training-Free Position Control in Multimodal Diffusion Transformers
- URL: http://arxiv.org/abs/2509.26644v1
- Date: Tue, 30 Sep 2025 17:59:51 GMT
- Title: Stitch: Training-Free Position Control in Multimodal Diffusion Transformers
- Authors: Jessica Bader, Mateusz Pach, Maria A. Bravo, Serge Belongie, Zeynep Akata,
- Abstract summary: Text-to-Image (T2I) generation models have advanced rapidly in recent years, but capturing spatial relationships poses a persistent challenge.<n>We propose Stitch, a training-free method for incorporating external position control into Multi-Modal Diffusion Transformers (MMDiT) via automatically-generated bounding boxes.<n>We find that targeted attention heads capture the information necessary to isolate and cut out individual objects mid-generation, without needing to fully complete the image.
- Score: 42.17131488826851
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-Image (T2I) generation models have advanced rapidly in recent years, but accurately capturing spatial relationships like "above" or "to the right of" poses a persistent challenge. Earlier methods improved spatial relationship following with external position control. However, as architectures evolved to enhance image quality, these techniques became incompatible with modern models. We propose Stitch, a training-free method for incorporating external position control into Multi-Modal Diffusion Transformers (MMDiT) via automatically-generated bounding boxes. Stitch produces images that are both spatially accurate and visually appealing by generating individual objects within designated bounding boxes and seamlessly stitching them together. We find that targeted attention heads capture the information necessary to isolate and cut out individual objects mid-generation, without needing to fully complete the image. We evaluate Stitch on PosEval, our benchmark for position-based T2I generation. Featuring five new tasks that extend the concept of Position beyond the basic GenEval task, PosEval demonstrates that even top models still have significant room for improvement in position-based generation. Tested on Qwen-Image, FLUX, and SD3.5, Stitch consistently enhances base models, even improving FLUX by 218% on GenEval's Position task and by 206% on PosEval. Stitch achieves state-of-the-art results with Qwen-Image on PosEval, improving over previous models by 54%, all accomplished while integrating position control into leading models training-free. Code is available at https://github.com/ExplainableML/Stitch.
Related papers
- GPA-VGGT:Adapting VGGT to Large Scale Localization by Self-Supervised Learning with Geometry and Physics Aware Loss [15.633839321933385]
Recent advancements in Visual Geometry Grounded Transformer (VGGT) models have shown great promise in camera pose estimation and 3D reconstruction.<n>These models typically rely on ground truth labels for training, posing challenges when adapting to unlabeled and unseen scenes.<n>We propose a self-supervised framework to train VGGT with unlabeled data, thereby enhancing its localization capability in large-scale environments.
arXiv Detail & Related papers (2026-01-23T16:46:59Z) - DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training [76.82789568988557]
DiT360 is a DiT-based framework that performs hybrid training on perspective and panoramic data for panoramic image generation.<n>Our method achieves better boundary consistency and image fidelity across eleven quantitative metrics.
arXiv Detail & Related papers (2025-10-13T17:59:15Z) - UP2You: Fast Reconstruction of Yourself from Unconstrained Photo Collections [21.55668740343458]
UP2You is a tuning-free solution for reconstructing high-fidelity 3D portraits from unconstrained in-the-wild 2D photos.<n>Central to UP2You is a pose-correlated feature aggregation module.<n>Experiments on 4D-Dress, PuzzleIOI, and in-the-wild captures demonstrate that UP2You consistently surpasses previous methods in both geometric accuracy and texture fidelity.
arXiv Detail & Related papers (2025-09-29T14:06:00Z) - PoseDiff: A Unified Diffusion Model Bridging Robot Pose Estimation and Video-to-Action Control [67.17998939712326]
We present PoseDiff, a conditional diffusion model that unifies robot state estimation and control within a single framework.<n>At its core, PoseDiff maps raw visual observations into structured robot states-such as 3D keypoints or joint angles-from a single RGB image.<n>Building upon this foundation, PoseDiff extends naturally to video-to-action inverse dynamics.
arXiv Detail & Related papers (2025-09-29T10:55:48Z) - SPFSplatV2: Efficient Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views [18.814209805277503]
SPFSplatV2, an efficient feed-forward framework for 3D Gaussian splatting from sparse multi-view images, is presented.<n>Method achieves state-of-the-art performance in both in-domain and out-of-domain novel view synthesis.
arXiv Detail & Related papers (2025-09-21T21:37:56Z) - Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation [32.190055780969466]
Stable-Pose is a novel adapter model that introduces a coarse-to-fine attention masking strategy into a vision Transformer.
We leverage the query-key self-attention mechanism of ViTs to explore the interconnections among different anatomical parts in human pose skeletons.
Stable-Pose achieved an AP score of 57.1 in the LAION-Human dataset, marking around 13% improvement over the established technique ControlNet.
arXiv Detail & Related papers (2024-06-04T16:54:28Z) - DivCon: Divide and Conquer for Complex Numerical and Spatial Reasoning in Text-to-Image Generation [0.0]
Diffusion-driven text-to-image (T2I) generation has achieved remarkable advancements in recent years.<n> layout is employed as an intermedium to bridge large language models and layout-based diffusion models.<n>We introduce a divide-and-conquer approach which decouples the generation task into multiple subtasks.
arXiv Detail & Related papers (2024-03-11T03:24:44Z) - Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation [147.81509219686419]
We propose a diagnostic benchmark for layout-guided image generation that examines four categories of spatial control skills: number, position, size, and shape.
Next, we propose IterInpaint, a new baseline that generates foreground and background regions step-by-step via inpainting.
We show comprehensive ablation studies on IterInpaint, including training task ratio, crop&paste vs. repaint, and generation order.
arXiv Detail & Related papers (2023-04-13T16:58:33Z) - PoseMatcher: One-shot 6D Object Pose Estimation by Deep Feature Matching [51.142988196855484]
We propose PoseMatcher, an accurate model free one-shot object pose estimator.
We create a new training pipeline for object to image matching based on a three-view system.
To enable PoseMatcher to attend to distinct input modalities, an image and a pointcloud, we introduce IO-Layer.
arXiv Detail & Related papers (2023-04-03T21:14:59Z) - CheckerPose: Progressive Dense Keypoint Localization for Object Pose
Estimation with Graph Neural Network [66.24726878647543]
Estimating the 6-DoF pose of a rigid object from a single RGB image is a crucial yet challenging task.
Recent studies have shown the great potential of dense correspondence-based solutions.
We propose a novel pose estimation algorithm named CheckerPose, which improves on three main aspects.
arXiv Detail & Related papers (2023-03-29T17:30:53Z) - Modeling Image Composition for Complex Scene Generation [77.10533862854706]
We present a method that achieves state-of-the-art results on layout-to-image generation tasks.
After compressing RGB images into patch tokens, we propose the Transformer with Focal Attention (TwFA) for exploring dependencies of object-to-object, object-to-patch and patch-to-patch.
arXiv Detail & Related papers (2022-06-02T08:34:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.