Ref-SAM3D: Bridging SAM3D with Text for Reference 3D Reconstruction
- URL: http://arxiv.org/abs/2511.19426v1
- Date: Mon, 24 Nov 2025 18:58:22 GMT
- Title: Ref-SAM3D: Bridging SAM3D with Text for Reference 3D Reconstruction
- Authors: Yun Zhou, Yaoting Wang, Guangquan Jie, Jinyu Liu, Henghui Ding,
- Abstract summary: Ref-SAM3D is a simple yet effective extension to SAM3D that incorporates textual descriptions as a high-level prior.<n>We show that Ref-SAM3D, guided only by natural language and a single 2D view, delivers competitive and high-fidelity zero-shot reconstruction performance.
- Score: 45.27825308128629
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: SAM3D has garnered widespread attention for its strong 3D object reconstruction capabilities. However, a key limitation remains: SAM3D cannot reconstruct specific objects referred to by textual descriptions, a capability that is essential for practical applications such as 3D editing, game development, and virtual environments. To address this gap, we introduce Ref-SAM3D, a simple yet effective extension to SAM3D that incorporates textual descriptions as a high-level prior, enabling text-guided 3D reconstruction from a single RGB image. Through extensive qualitative experiments, we show that Ref-SAM3D, guided only by natural language and a single 2D view, delivers competitive and high-fidelity zero-shot reconstruction performance. Our results demonstrate that Ref-SAM3D effectively bridges the gap between 2D visual cues and 3D geometric understanding, offering a more flexible and accessible paradigm for reference-guided 3D reconstruction. Code is available at: https://github.com/FudanCVL/Ref-SAM3D.
Related papers
- MV-SAM: Multi-view Promptable Segmentation using Pointmap Guidance [79.57732829495843]
We introduce MV-SAM, a framework for multi-view segmentation that achieves 3D consistency using pointmaps.<n>MV-SAM lifts images and prompts into 3D space, eliminating the need for explicit 3D networks or annotated 3D data.
arXiv Detail & Related papers (2026-01-25T15:00:37Z) - SAM 3D for 3D Object Reconstruction from Remote Sensing Images [3.893451853752809]
This paper presents the first systematic evaluation of SAM 3D, a general-purpose image-to-3D foundation model.<n> Experimental results demonstrate that SAM 3D produces more coherent roof geometry and sharper boundaries compared to TRELLIS.
arXiv Detail & Related papers (2025-12-27T03:47:39Z) - SAM 3D: 3Dfy Anything in Images [99.1053358868456]
We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image.<n>We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose.<n>We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.
arXiv Detail & Related papers (2025-11-20T18:31:46Z) - SR3D: Unleashing Single-view 3D Reconstruction for Transparent and Specular Object Grasping [7.222966501323922]
We propose a training free framework SR3D that enables robotic grasping of transparent and specular objects from a single view observation.<n>Specifically, given single view RGB and depth images, SR3D first uses the external visual models to generate 3D reconstructed object mesh.<n>Then, the key idea is to determine the 3D object's pose and scale to accurately localize the reconstructed object back into its original depth corrupted 3D scene.
arXiv Detail & Related papers (2025-05-30T07:38:46Z) - Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images [66.77399370856462]
Amodal3R is a conditional 3D generative model designed to reconstruct 3D objects from partial observations.<n>It learns to recover full 3D objects even in the presence of occlusions in real scenes.<n>It substantially outperforms existing methods that independently perform 2D amodal completion followed by 3D reconstruction.
arXiv Detail & Related papers (2025-03-17T17:59:01Z) - ShapeLLM: Universal 3D Object Understanding for Embodied Interaction [37.0434133128805]
This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction.
ShapeLLM is built upon an improved 3D encoder by extending ReCon to ReCon++.
ShapeLLM is trained on constructed instruction-following data and tested on our newly human-curated benchmark, 3D MM-Vet.
arXiv Detail & Related papers (2024-02-27T18:57:12Z) - Anything-3D: Towards Single-view Anything Reconstruction in the Wild [61.090129285205805]
We introduce Anything-3D, a methodical framework that ingeniously combines a series of visual-language models and the Segment-Anything object segmentation model.
Our approach employs a BLIP model to generate textural descriptions, utilize the Segment-Anything model for the effective extraction of objects of interest, and leverages a text-to-image diffusion model to lift object into a neural radiance field.
arXiv Detail & Related papers (2023-04-19T16:39:51Z) - MobileBrick: Building LEGO for 3D Reconstruction on Mobile Devices [78.20154723650333]
High-quality 3D ground-truth shapes are critical for 3D object reconstruction evaluation.
We introduce a novel multi-view RGBD dataset captured using a mobile device.
We obtain precise 3D ground-truth shape without relying on high-end 3D scanners.
arXiv Detail & Related papers (2023-03-03T14:02:50Z) - Monocular 3D Object Reconstruction with GAN Inversion [122.96094885939146]
MeshInversion is a novel framework to improve the reconstruction of textured 3D meshes.
It exploits the generative prior of a 3D GAN pre-trained for 3D textured mesh synthesis.
Our framework obtains faithful 3D reconstructions with consistent geometry and texture across both observed and unobserved parts.
arXiv Detail & Related papers (2022-07-20T17:47:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.