ShapeR: Robust Conditional 3D Shape Generation from Casual Captures
- URL: http://arxiv.org/abs/2601.11514v1
- Date: Fri, 16 Jan 2026 18:51:24 GMT
- Title: ShapeR: Robust Conditional 3D Shape Generation from Casual Captures
- Authors: Yawar Siddiqui, Duncan Frost, Samir Aroudj, Armen Avetisyan, Henry Howard-Jenkins, Daniel DeTone, Pierre Moulon, Qirui Wu, Zhengqin Li, Julian Straub, Richard Newcombe, Jakob Engel,
- Abstract summary: We present ShapeR, a novel approach for conditional 3D object shape generation from casually captured sequences.<n>We leverage off-the-shelf visual-inertial SLAM, 3D detection algorithms, and vision-language models to extract, for each object, a set of sparse SLAM points, posed multi-view images, and machine-generated captions.<n>A rectified flow transformer trained to effectively condition on these modalities then generates high-fidelity metric 3D shapes.
- Score: 16.658667897650506
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in 3D shape generation have achieved impressive results, but most existing methods rely on clean, unoccluded, and well-segmented inputs. Such conditions are rarely met in real-world scenarios. We present ShapeR, a novel approach for conditional 3D object shape generation from casually captured sequences. Given an image sequence, we leverage off-the-shelf visual-inertial SLAM, 3D detection algorithms, and vision-language models to extract, for each object, a set of sparse SLAM points, posed multi-view images, and machine-generated captions. A rectified flow transformer trained to effectively condition on these modalities then generates high-fidelity metric 3D shapes. To ensure robustness to the challenges of casually captured data, we employ a range of techniques including on-the-fly compositional augmentations, a curriculum training scheme spanning object- and scene-level datasets, and strategies to handle background clutter. Additionally, we introduce a new evaluation benchmark comprising 178 in-the-wild objects across 7 real-world scenes with geometry annotations. Experiments show that ShapeR significantly outperforms existing approaches in this challenging setting, achieving an improvement of 2.7x in Chamfer distance compared to state of the art.
Related papers
- HORT: Monocular Hand-held Objects Reconstruction with Transformers [61.36376511119355]
Reconstructing hand-held objects in 3D from monocular images is a significant challenge in computer vision.<n>We propose a transformer-based model to efficiently reconstruct dense 3D point clouds of hand-held objects.<n>Our method achieves state-of-the-art accuracy with much faster inference speed, while generalizing well to in-the-wild images.
arXiv Detail & Related papers (2025-03-27T09:45:09Z) - Robust 3D Shape Reconstruction in Zero-Shot from a Single Image in the Wild [22.82439286651921]
We propose a unified regression model that integrates segmentation and reconstruction, specifically designed for 3D shape reconstruction.<n>We also introduce a scalable data synthesis pipeline that simulates a wide range of variations in objects, occluders, and backgrounds.<n>Our training on our synthetic data enables the proposed model to achieve state-of-the-art zero-shot results on real-world images.
arXiv Detail & Related papers (2024-03-21T16:40:10Z) - ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance [76.7746870349809]
We present ComboVerse, a 3D generation framework that produces high-quality 3D assets with complex compositions by learning to combine multiple models.
Our proposed framework emphasizes spatial alignment of objects, compared with standard score distillation sampling.
arXiv Detail & Related papers (2024-03-19T03:39:43Z) - 3D-SceneDreamer: Text-Driven 3D-Consistent Scene Generation [51.64796781728106]
We propose a generative refinement network to synthesize new contents with higher quality by exploiting the natural image prior to 2D diffusion model and the global 3D information of the current scene.
Our approach supports wide variety of scene generation and arbitrary camera trajectories with improved visual quality and 3D consistency.
arXiv Detail & Related papers (2024-03-14T14:31:22Z) - LASA: Instance Reconstruction from Real Scans using A Large-scale
Aligned Shape Annotation Dataset [17.530432165466507]
We present a novel Cross-Modal Shape Reconstruction (DisCo) method and an Occupancy-Guided 3D Object Detection (OccGOD) method.
Our methods achieve state-of-the-art performance in both instance-level scene reconstruction and 3D object detection tasks.
arXiv Detail & Related papers (2023-12-19T18:50:10Z) - Robust 3D Tracking with Quality-Aware Shape Completion [67.9748164949519]
We propose a synthetic target representation composed of dense and complete point clouds depicting the target shape precisely by shape completion for robust 3D tracking.
Specifically, we design a voxelized 3D tracking framework with shape completion, in which we propose a quality-aware shape completion mechanism to alleviate the adverse effect of noisy historical predictions.
arXiv Detail & Related papers (2023-12-17T04:50:24Z) - FrozenRecon: Pose-free 3D Scene Reconstruction with Frozen Depth Models [67.96827539201071]
We propose a novel test-time optimization approach for 3D scene reconstruction.
Our method achieves state-of-the-art cross-dataset reconstruction on five zero-shot testing datasets.
arXiv Detail & Related papers (2023-08-10T17:55:02Z) - 3D Surface Reconstruction in the Wild by Deforming Shape Priors from
Synthetic Data [24.97027425606138]
Reconstructing the underlying 3D surface of an object from a single image is a challenging problem.
We present a new method for joint category-specific 3D reconstruction and object pose estimation from a single image.
Our approach achieves state-of-the-art reconstruction performance across several real-world datasets.
arXiv Detail & Related papers (2023-02-24T20:37:27Z) - Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled
Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras.
We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points.
Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.