Related papers: Ref-SAM3D: Bridging SAM3D with Text for Reference 3D Reconstruction

Ref-SAM3D: Bridging SAM3D with Text for Reference 3D Reconstruction

URL: http://arxiv.org/abs/2511.19426v1
Date: Mon, 24 Nov 2025 18:58:22 GMT
Title: Ref-SAM3D: Bridging SAM3D with Text for Reference 3D Reconstruction
Authors: Yun Zhou, Yaoting Wang, Guangquan Jie, Jinyu Liu, Henghui Ding,
Abstract summary: Ref-SAM3D is a simple yet effective extension to SAM3D that incorporates textual descriptions as a high-level prior.<n>We show that Ref-SAM3D, guided only by natural language and a single 2D view, delivers competitive and high-fidelity zero-shot reconstruction performance.
Score: 45.27825308128629
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: SAM3D has garnered widespread attention for its strong 3D object reconstruction capabilities. However, a key limitation remains: SAM3D cannot reconstruct specific objects referred to by textual descriptions, a capability that is essential for practical applications such as 3D editing, game development, and virtual environments. To address this gap, we introduce Ref-SAM3D, a simple yet effective extension to SAM3D that incorporates textual descriptions as a high-level prior, enabling text-guided 3D reconstruction from a single RGB image. Through extensive qualitative experiments, we show that Ref-SAM3D, guided only by natural language and a single 2D view, delivers competitive and high-fidelity zero-shot reconstruction performance. Our results demonstrate that Ref-SAM3D effectively bridges the gap between 2D visual cues and 3D geometric understanding, offering a more flexible and accessible paradigm for reference-guided 3D reconstruction. Code is available at: https://github.com/FudanCVL/Ref-SAM3D.

Related papers

MV-SAM: Multi-view Promptable Segmentation using Pointmap Guidance [79.57732829495843]
We introduce MV-SAM, a framework for multi-view segmentation that achieves 3D consistency using pointmaps.<n>MV-SAM lifts images and prompts into 3D space, eliminating the need for explicit 3D networks or annotated 3D data.
arXiv Detail & Related papers (2026-01-25T15:00:37Z)
SAM 3D for 3D Object Reconstruction from Remote Sensing Images [3.893451853752809]
This paper presents the first systematic evaluation of SAM 3D, a general-purpose image-to-3D foundation model.<n> Experimental results demonstrate that SAM 3D produces more coherent roof geometry and sharper boundaries compared to TRELLIS.
arXiv Detail & Related papers (2025-12-27T03:47:39Z)
SAM 3D: 3Dfy Anything in Images [99.1053358868456]
We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image.<n>We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose.<n>We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.
arXiv Detail & Related papers (2025-11-20T18:31:46Z)
SR3D: Unleashing Single-view 3D Reconstruction for Transparent and Specular Object Grasping [7.222966501323922]
We propose a training free framework SR3D that enables robotic grasping of transparent and specular objects from a single view observation.<n>Specifically, given single view RGB and depth images, SR3D first uses the external visual models to generate 3D reconstructed object mesh.<n>Then, the key idea is to determine the 3D object's pose and scale to accurately localize the reconstructed object back into its original depth corrupted 3D scene.
arXiv Detail & Related papers (2025-05-30T07:38:46Z)
Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images [66.77399370856462]
Amodal3R is a conditional 3D generative model designed to reconstruct 3D objects from partial observations.<n>It learns to recover full 3D objects even in the presence of occlusions in real scenes.<n>It substantially outperforms existing methods that independently perform 2D amodal completion followed by 3D reconstruction.
arXiv Detail & Related papers (2025-03-17T17:59:01Z)
ShapeLLM: Universal 3D Object Understanding for Embodied Interaction [37.0434133128805]
This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction. ShapeLLM is built upon an improved 3D encoder by extending ReCon to ReCon++. ShapeLLM is trained on constructed instruction-following data and tested on our newly human-curated benchmark, 3D MM-Vet.
arXiv Detail & Related papers (2024-02-27T18:57:12Z)
Anything-3D: Towards Single-view Anything Reconstruction in the Wild [61.090129285205805]
We introduce Anything-3D, a methodical framework that ingeniously combines a series of visual-language models and the Segment-Anything object segmentation model. Our approach employs a BLIP model to generate textural descriptions, utilize the Segment-Anything model for the effective extraction of objects of interest, and leverages a text-to-image diffusion model to lift object into a neural radiance field.
arXiv Detail & Related papers (2023-04-19T16:39:51Z)
MobileBrick: Building LEGO for 3D Reconstruction on Mobile Devices [78.20154723650333]
High-quality 3D ground-truth shapes are critical for 3D object reconstruction evaluation. We introduce a novel multi-view RGBD dataset captured using a mobile device. We obtain precise 3D ground-truth shape without relying on high-end 3D scanners.
arXiv Detail & Related papers (2023-03-03T14:02:50Z)
Monocular 3D Object Reconstruction with GAN Inversion [122.96094885939146]
MeshInversion is a novel framework to improve the reconstruction of textured 3D meshes. It exploits the generative prior of a 3D GAN pre-trained for 3D textured mesh synthesis. Our framework obtains faithful 3D reconstructions with consistent geometry and texture across both observed and unobserved parts.
arXiv Detail & Related papers (2022-07-20T17:47:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.