Ctrl&Shift: High-Quality Geometry-Aware Object Manipulation in Visual Generation
- URL: http://arxiv.org/abs/2602.11440v1
- Date: Wed, 11 Feb 2026 23:36:30 GMT
- Title: Ctrl&Shift: High-Quality Geometry-Aware Object Manipulation in Visual Generation
- Authors: Penghui Ruan, Bojia Zi, Xianbiao Qi, Youze Huang, Rong Xiao, Pichao Wang, Jiannong Cao, Yuhui Shi,
- Abstract summary: We present Ctrl&Shift, an end-to-end diffusion framework to achieve geometry-consistent object manipulation without explicit 3D representations.<n>Our key insight is to decompose manipulation into two stages, object removal and reference-guided inpainting under explicit camera pose control, and encode both within a unified diffusion process.<n>To our knowledge, this is the first framework to unify fine-grained geometric control and real-world generalization for object manipulation, without relying on any explicit 3D modeling.
- Score: 34.92056161129864
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Object-level manipulation, relocating or reorienting objects in images or videos while preserving scene realism, is central to film post-production, AR, and creative editing. Yet existing methods struggle to jointly achieve three core goals: background preservation, geometric consistency under viewpoint shifts, and user-controllable transformations. Geometry-based approaches offer precise control but require explicit 3D reconstruction and generalize poorly; diffusion-based methods generalize better but lack fine-grained geometric control. We present Ctrl&Shift, an end-to-end diffusion framework to achieve geometry-consistent object manipulation without explicit 3D representations. Our key insight is to decompose manipulation into two stages, object removal and reference-guided inpainting under explicit camera pose control, and encode both within a unified diffusion process. To enable precise, disentangled control, we design a multi-task, multi-stage training strategy that separates background, identity, and pose signals across tasks. To improve generalization, we introduce a scalable real-world dataset construction pipeline that generates paired image and video samples with estimated relative camera poses. Extensive experiments demonstrate that Ctrl&Shift achieves state-of-the-art results in fidelity, viewpoint consistency, and controllability. To our knowledge, this is the first framework to unify fine-grained geometric control and real-world generalization for object manipulation, without relying on any explicit 3D modeling.
Related papers
- GPA-VGGT:Adapting VGGT to Large Scale Localization by Self-Supervised Learning with Geometry and Physics Aware Loss [15.633839321933385]
Recent advancements in Visual Geometry Grounded Transformer (VGGT) models have shown great promise in camera pose estimation and 3D reconstruction.<n>These models typically rely on ground truth labels for training, posing challenges when adapting to unlabeled and unseen scenes.<n>We propose a self-supervised framework to train VGGT with unlabeled data, thereby enhancing its localization capability in large-scale environments.
arXiv Detail & Related papers (2026-01-23T16:46:59Z) - POCI-Diff: Position Objects Consistently and Interactively with 3D-Layout Guided Diffusion [46.97254555348757]
We propose a diffusion-based approach for Text-to-Image (T2I) generation with consistent and interactive 3D layout control and editing.<n>We introduce a framework for Positioning Objects Consistently and Interactively (POCI-Diff)<n>Our method enables explicit per-object semantic control by binding individual text descriptions to specific 3D bounding boxes.
arXiv Detail & Related papers (2026-01-20T15:13:43Z) - Free-Form Scene Editor: Enabling Multi-Round Object Manipulation like in a 3D Engine [83.0145525456509]
We present FFSE, a 3D-aware framework designed to enable intuitive, physically-consistent object editing on real-world images.<n>Unlike previous approaches that either operate in image space or require slow and error-prone 3D reconstruction, FFSE models editing as a sequence of learned 3D transformations.<n>To support learning of multi-round 3D-aware object manipulation, we introduce 3DObjectEditor.
arXiv Detail & Related papers (2025-11-17T18:57:39Z) - IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction [82.53307702809606]
Humans naturally perceive the geometric structure and semantic content of a 3D world as intertwined dimensions.<n>We propose InstanceGrounded Geometry Transformer (IGGT) to unify the knowledge for both spatial reconstruction and instance-level contextual understanding.
arXiv Detail & Related papers (2025-10-26T14:57:44Z) - FreeInsert: Personalized Object Insertion with Geometric and Style Control [26.088650452374726]
We propose a training-free framework that customizes object insertion into arbitrary scenes by leveraging 3D geometric information.<n>The rendered image, serving as geometric control, is combined with style and content control achieved through diffusion adapters.
arXiv Detail & Related papers (2025-09-25T05:26:10Z) - A Controllable 3D Deepfake Generation Framework with Gaussian Splatting [6.969908558294805]
We propose a novel 3D deepfake generation framework based on 3D Gaussian Splatting.<n>It enables realistic, identity-preserving face swapping and reenactment in a fully controllable 3D space.<n>Our approach bridges the gap between 3D modeling and deepfake synthesis, enabling new directions for scene-aware, controllable, and immersive visual forgeries.
arXiv Detail & Related papers (2025-09-15T06:34:17Z) - LACONIC: A 3D Layout Adapter for Controllable Image Creation [22.96293773013579]
Existing generative approaches for guided image synthesis rely on 2D controls in the image or text space.<n>We propose a novel conditioning approach, training method and adapter network that can be plugged into pretrained text-to-image diffusion models.<n>Our method supports camera control, conditioning geometries on explicit 3D and, for the first time, accounts for the entire context of a scene.
arXiv Detail & Related papers (2025-07-04T02:25:36Z) - FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views [100.45129752375658]
We present FLARE, a feed-forward model designed to infer high-quality camera poses and 3D geometry from uncalibrated sparse-view images.<n>Our solution features a cascaded learning paradigm with camera pose serving as the critical bridge, recognizing its essential role in mapping 3D structures onto 2D image planes.
arXiv Detail & Related papers (2025-02-17T18:54:05Z) - MagicDrive: Street View Generation with Diverse 3D Geometry Control [82.69871576797166]
We introduce MagicDrive, a novel street view generation framework, offering diverse 3D geometry controls.
Our design incorporates a cross-view attention module, ensuring consistency across multiple camera views.
arXiv Detail & Related papers (2023-10-04T06:14:06Z) - Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold [79.94300820221996]
DragGAN is a new way of controlling generative adversarial networks (GANs)
DragGAN allows anyone to deform an image with precise control over where pixels go, thus manipulating the pose, shape, expression, and layout of diverse categories such as animals, cars, humans, landscapes, etc.
Both qualitative and quantitative comparisons demonstrate the advantage of DragGAN over prior approaches in the tasks of image manipulation and point tracking.
arXiv Detail & Related papers (2023-05-18T13:41:25Z) - GDRNPP: A Geometry-guided and Fully Learning-based Object Pose Estimator [51.89441403642665]
6D pose estimation of rigid objects is a long-standing and challenging task in computer vision.<n>Recently, the emergence of deep learning reveals the potential of Convolutional Neural Networks (CNNs) to predict reliable 6D poses.<n>This paper introduces a fully learning-based object pose estimator.
arXiv Detail & Related papers (2021-02-24T09:11:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.