Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes
- URL: http://arxiv.org/abs/2601.02356v2
- Date: Thu, 08 Jan 2026 03:56:27 GMT
- Title: Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes
- Authors: Jing Tan, Zhaoyang Zhang, Yantao Shen, Jiarui Cai, Shuo Yang, Jiajun Wu, Wei Xia, Zhuowen Tu, Stefano Soatto,
- Abstract summary: We introduce Talk2Move, a framework for text-instructed spatial transformation of objects within scenes.<n>Talk2Move employs Group Relative Policy Optimization to explore geometric actions through diverse rollouts.<n> Experiments on curated benchmarks demonstrate that Talk2Move achieves precise, consistent, and semantically faithful object transformations.
- Score: 69.4534914304302
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Talk2Move, a reinforcement learning (RL) based diffusion framework for text-instructed spatial transformation of objects within scenes. Spatially manipulating objects in a scene through natural language poses a challenge for multimodal generation systems. While existing text-based manipulation methods can adjust appearance or style, they struggle to perform object-level geometric transformations-such as translating, rotating, or resizing objects-due to scarce paired supervision and pixel-level optimization limits. Talk2Move employs Group Relative Policy Optimization (GRPO) to explore geometric actions through diverse rollouts generated from input images and lightweight textual variations, removing the need for costly paired data. A spatial reward guided model aligns geometric transformations with linguistic description, while off-policy step evaluation and active step sampling improve learning efficiency by focusing on informative transformation stages. Furthermore, we design object-centric spatial rewards that evaluate displacement, rotation, and scaling behaviors directly, enabling interpretable and coherent transformations. Experiments on curated benchmarks demonstrate that Talk2Move achieves precise, consistent, and semantically faithful object transformations, outperforming existing text-guided editing approaches in both spatial accuracy and scene coherence.
Related papers
- Two-Stream Interactive Joint Learning of Scene Parsing and Geometric Vision Tasks [24.19752468668527]
Two Interactive Streams (TwInS) is a novel bio-inspired joint learning framework capable of simultaneously performing scene parsing and geometric vision tasks.<n>To eliminate the dependence on costly human-annotated correspondence ground truth, TwInS is equipped with a tailored semi-supervised training strategy.
arXiv Detail & Related papers (2026-02-14T04:11:19Z) - Copy-Trasform-Paste: Zero-Shot Object-Object Alignment Guided by Vision-Language and Geometric Constraints [12.704390013489054]
We study zero-shot 3D alignment of two given meshes, using a text prompt describing their relation.<n>We optimize the relative pose at test time, updating translation, rotation, and isotropic scale with CLIP-driven gradients.<n>Our method outperforms all alternatives, yielding semantically faithful and physically plausible alignments.
arXiv Detail & Related papers (2026-01-20T18:12:55Z) - Mash, Spread, Slice! Learning to Manipulate Object States via Visual Spatial Progress [53.723881111373736]
We present SPARTA, the first unified framework for the family of object state change manipulation tasks.<n>SPARTA integrates spatially progressing object change segmentation maps, a visual skill to perceive actionable vs. transformed regions, and dense rewards that capture incremental progress over time.<n>We validate SPARTA on a real robot for three challenging tasks across 10 diverse real-world objects.
arXiv Detail & Related papers (2025-09-28T23:56:07Z) - DanceText: A Training-Free Layered Framework for Controllable Multilingual Text Transformation in Images [28.48453375674059]
DanceText is a training-free framework for multilingual text editing in images.<n>It supports complex geometric transformations and achieves seamless foreground-background integration.
arXiv Detail & Related papers (2025-04-18T23:46:32Z) - IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments [56.85804719947]
We present IAAO, a framework that builds an explicit 3D model for intelligent agents to gain understanding of articulated objects in their environment through interaction.<n>We first build hierarchical features and label fields for each object state using 3D Gaussian Splatting (3DGS) by distilling mask features and view-consistent labels from multi-view images.<n>We then perform object- and part-level queries on the 3D Gaussian primitives to identify static and articulated elements, estimating global transformations and local articulation parameters along with affordances.
arXiv Detail & Related papers (2025-04-09T12:36:48Z) - Dynamic Scene Understanding through Object-Centric Voxelization and Neural Rendering [57.895846642868904]
We present a 3D generative model named DynaVol-S for dynamic scenes that enables object-centric learning.<n>voxelization infers per-object occupancy probabilities at individual spatial locations.<n>Our approach integrates 2D semantic features to create 3D semantic grids, representing the scene through multiple disentangled voxel grids.
arXiv Detail & Related papers (2024-07-30T15:33:58Z) - Equivariant Spatio-Temporal Self-Supervision for LiDAR Object Detection [37.142470149311904]
We propose atemporal equivariant learning framework by considering both spatial and temporal augmentations jointly.
We show our pre-training method for 3D object detection which outperforms existing equivariant and invariant approaches in many settings.
arXiv Detail & Related papers (2024-04-17T20:41:49Z) - O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation [9.431926560072412]
We propose O2V-mapping, which utilizes voxel-based language and geometric features to create an open-vocabulary field.<n>Experiments on open-vocabulary object localization and semantic segmentation demonstrate that O2V-mapping achieves online construction of language scenes.
arXiv Detail & Related papers (2024-04-10T08:54:43Z) - SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes [59.23385953161328]
Novel view synthesis for dynamic scenes is still a challenging problem in computer vision and graphics.
We propose a new representation that explicitly decomposes the motion and appearance of dynamic scenes into sparse control points and dense Gaussians.
Our method can enable user-controlled motion editing while retaining high-fidelity appearances.
arXiv Detail & Related papers (2023-12-04T11:57:14Z) - SemanticBoost: Elevating Motion Generation with Augmented Textual Cues [73.83255805408126]
Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD)
The CAMD approach provides an all-encompassing solution for generating high-quality, semantically consistent motion sequences.
Our experimental results demonstrate that SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based techniques.
arXiv Detail & Related papers (2023-10-31T09:58:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.