Copy-Trasform-Paste: Zero-Shot Object-Object Alignment Guided by Vision-Language and Geometric Constraints
- URL: http://arxiv.org/abs/2601.14207v1
- Date: Tue, 20 Jan 2026 18:12:55 GMT
- Title: Copy-Trasform-Paste: Zero-Shot Object-Object Alignment Guided by Vision-Language and Geometric Constraints
- Authors: Rotem Gatenyo, Ohad Fried,
- Abstract summary: We study zero-shot 3D alignment of two given meshes, using a text prompt describing their relation.<n>We optimize the relative pose at test time, updating translation, rotation, and isotropic scale with CLIP-driven gradients.<n>Our method outperforms all alternatives, yielding semantically faithful and physically plausible alignments.
- Score: 12.704390013489054
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We study zero-shot 3D alignment of two given meshes, using a text prompt describing their spatial relation -- an essential capability for content creation and scene assembly. Earlier approaches primarily rely on geometric alignment procedures, while recent work leverages pretrained 2D diffusion models to model language-conditioned object-object spatial relationships. In contrast, we directly optimize the relative pose at test time, updating translation, rotation, and isotropic scale with CLIP-driven gradients via a differentiable renderer, without training a new model. Our framework augments language supervision with geometry-aware objectives: a variant of soft-Iterative Closest Point (ICP) term to encourage surface attachment and a penetration loss to discourage interpenetration. A phased schedule strengthens contact constraints over time, and camera control concentrates the optimization on the interaction region. To enable evaluation, we curate a benchmark containing diverse categories and relations, and compare against baselines. Our method outperforms all alternatives, yielding semantically faithful and physically plausible alignments.
Related papers
- Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes [69.4534914304302]
We introduce Talk2Move, a framework for text-instructed spatial transformation of objects within scenes.<n>Talk2Move employs Group Relative Policy Optimization to explore geometric actions through diverse rollouts.<n> Experiments on curated benchmarks demonstrate that Talk2Move achieves precise, consistent, and semantically faithful object transformations.
arXiv Detail & Related papers (2026-01-05T18:55:32Z) - Video Spatial Reasoning with Object-Centric 3D Rollout [58.12446467377404]
We propose Object-Centric 3D Rollout (OCR) to enable robust video spatial reasoning.<n>OCR introduces structured perturbations to the 3D geometry of selected objects during training.<n>OCR compels the model to reason holistically across the entire scene.
arXiv Detail & Related papers (2025-11-17T09:53:41Z) - Follow My Hold: Hand-Object Interaction Reconstruction through Geometric Guidance [61.41904916189093]
We propose a novel diffusion-based framework for reconstructing 3D geometry of hand-held objects from monocular RGB images.<n>We use hand-object interaction as geometric guidance to ensure plausible hand-object interactions.
arXiv Detail & Related papers (2025-08-25T17:11:53Z) - Why Settle for Mid: A Probabilistic Viewpoint to Spatial Relationship Alignment in Text-to-image Models [3.5999252362400993]
A prevalent issue in compositional generation is the misalignment of spatial relationships.<n>We introduce a novel evaluation metric designed to assess the alignment of 2D and 3D spatial relationships between text and image.<n>We also propose PoS-based Generation, an inference-time method that improves the alignment of 2D and 3D spatial relationships in T2I models without requiring fine-tuning.
arXiv Detail & Related papers (2025-06-29T22:41:27Z) - Hierarchical Context Alignment with Disentangled Geometric and Temporal Modeling for Semantic Occupancy Prediction [61.484280369655536]
Camera-based 3D Semantic Occupancy Prediction (SOP) is crucial for understanding complex 3D scenes from limited 2D image observations.<n>Existing SOP methods typically aggregate contextual features to assist the occupancy representation learning.<n>We introduce a new Hierarchical context alignment paradigm for a more accurate SOP (Hi-SOP)
arXiv Detail & Related papers (2024-12-11T09:53:10Z) - SeMv-3D: Towards Concurrency of Semantic and Multi-view Consistency in General Text-to-3D Generation [122.47961178994456]
SeMv-3D is a novel framework that jointly enhances semantic alignment and multi-view consistency in GT23D generation.<n>At its core, we introduce Triplane Prior Learning (TPL), which effectively learns triplane priors.<n>We also present Prior-based Semantic Aligning in Triplanes (SAT), which enables consistent any-view synthesis.
arXiv Detail & Related papers (2024-10-10T07:02:06Z) - Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion.
Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.