Articulate3D: Zero-Shot Text-Driven 3D Object Posing
- URL: http://arxiv.org/abs/2508.19244v1
- Date: Tue, 26 Aug 2025 17:59:17 GMT
- Title: Articulate3D: Zero-Shot Text-Driven 3D Object Posing
- Authors: Oishi Deb, Anjun Hu, Ashkan Khakzar, Philip Torr, Christian Rupprecht,
- Abstract summary: We propose a training-free method, Articulate3D, to pose a 3D asset through language control.<n>We modify a powerful image-generator to create target images conditioned on the input image and a text instruction.<n>We then align the mesh to the target images through a multi-view pose optimisation step.
- Score: 38.75075284385844
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a training-free method, Articulate3D, to pose a 3D asset through language control. Despite advances in vision and language models, this task remains surprisingly challenging. To achieve this goal, we decompose the problem into two steps. We modify a powerful image-generator to create target images conditioned on the input image and a text instruction. We then align the mesh to the target images through a multi-view pose optimisation step. In detail, we introduce a self-attention rewiring mechanism (RSActrl) that decouples the source structure from pose within an image generative model, allowing it to maintain a consistent structure across varying poses. We observed that differentiable rendering is an unreliable signal for articulation optimisation; instead, we use keypoints to establish correspondences between input and target images. The effectiveness of Articulate3D is demonstrated across a diverse range of 3D objects and free-form text prompts, successfully manipulating poses while maintaining the original identity of the mesh. Quantitative evaluations and a comparative user study, in which our method was preferred over 85\% of the time, confirm its superiority over existing approaches. Project page:https://odeb1.github.io/articulate3d_page_deb/
Related papers
- Controllable 3D Object Generation with Single Image Prompt [2.4622211579286133]
3D object generation tasks are one of the fastest-growing segments in computer vision.<n>Text-to-image generative models employ textual inversion to learn concepts or styles of target object in the pseudo text prompt embedding space.<n>We propose two innovative methods: (1) using an off-the-shelf image adapter that generates 3D objects without textual inversion, offering enhanced control over conditions such as depth, pose, and text; and (2) a depth conditioned warmup strategy to enhance 3D consistency.
arXiv Detail & Related papers (2025-11-27T08:03:56Z) - NVSMask3D: Hard Visual Prompting with Camera Pose Interpolation for 3D Open Vocabulary Instance Segmentation [14.046423852723615]
We introduce a novel 3D Gaussian Splatting based hard visual prompting approach to generate diverse viewpoints around target objects.<n>Our method simulates realistic 3D perspectives, effectively augmenting existing hard visual prompts.<n>This training-free strategy integrates seamlessly with prior hard visual prompts, enriching object-descriptive features.
arXiv Detail & Related papers (2025-04-20T14:39:27Z) - ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models [65.22994156658918]
We present a method that learns to generate multi-view images in a single denoising process from real-world data.
We design an autoregressive generation that renders more 3D-consistent images at any viewpoint.
arXiv Detail & Related papers (2024-03-04T07:57:05Z) - TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes [67.5351491691866]
We present a novel framework, dubbed TeMO, to parse multi-object 3D scenes and edit their styles.
Our method can synthesize high-quality stylized content and outperform the existing methods over a wide range of multi-object 3D meshes.
arXiv Detail & Related papers (2023-12-07T12:10:05Z) - InstructPix2NeRF: Instructed 3D Portrait Editing from a Single Image [25.076270175205593]
InstructPix2NeRF enables instructed 3D-aware portrait editing from a single open-world image with human instructions.
At its core lies a conditional latent 3D diffusion process that lifts 2D editing to 3D space by learning the correlation between the paired images' difference and the instructions via triplet data.
arXiv Detail & Related papers (2023-11-06T02:21:11Z) - Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training [51.632418297156605]
We introduce MixCon3D, a method aiming to sculpt holistic 3D representation in contrastive language-image-3D pre-training.
We develop the 3D object-level representation from complementary perspectives, e.g., multi-view rendered images with the point cloud.
Then, MixCon3D performs language-3D contrastive learning, comprehensively depicting real-world 3D objects and bolstering text alignment.
arXiv Detail & Related papers (2023-11-03T06:05:36Z) - Guide3D: Create 3D Avatars from Text and Image Guidance [55.71306021041785]
Guide3D is a text-and-image-guided generative model for 3D avatar generation based on diffusion models.
Our framework produces topologically and structurally correct geometry and high-resolution textures.
arXiv Detail & Related papers (2023-08-18T17:55:47Z) - Bridging the Domain Gap: Self-Supervised 3D Scene Understanding with
Foundation Models [18.315856283440386]
Foundation models have achieved remarkable results in 2D and language tasks like image segmentation, object detection, and visual-language understanding.
Their potential to enrich 3D scene representation learning is largely untapped due to the existence of the domain gap.
We propose an innovative methodology called Bridge3D to address this gap by pre-training 3D models using features, semantic masks, and sourced captions from foundation models.
arXiv Detail & Related papers (2023-05-15T16:36:56Z) - End-to-End Learning of Multi-category 3D Pose and Shape Estimation [128.881857704338]
We propose an end-to-end method that simultaneously detects 2D keypoints from an image and lifts them to 3D.
The proposed method learns both 2D detection and 3D lifting only from 2D keypoints annotations.
In addition to being end-to-end in image to 3D learning, our method also handles objects from multiple categories using a single neural network.
arXiv Detail & Related papers (2021-12-19T17:10:40Z) - Unsupervised Learning of Visual 3D Keypoints for Control [104.92063943162896]
Learning sensorimotor control policies from high-dimensional images crucially relies on the quality of the underlying visual representations.
We propose a framework to learn such a 3D geometric structure directly from images in an end-to-end unsupervised manner.
These discovered 3D keypoints tend to meaningfully capture robot joints as well as object movements in a consistent manner across both time and 3D space.
arXiv Detail & Related papers (2021-06-14T17:59:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.