Related papers: WALT3D: Generating Realistic Training Data from Time-Lapse Imagery for Reconstructing Dynamic Objects under Occlusion

WALT3D: Generating Realistic Training Data from Time-Lapse Imagery for Reconstructing Dynamic Objects under Occlusion

URL: http://arxiv.org/abs/2403.19022v2
Date: Mon, 1 Apr 2024 18:09:49 GMT
Title: WALT3D: Generating Realistic Training Data from Time-Lapse Imagery for Reconstructing Dynamic Objects under Occlusion
Authors: Khiem Vuong, N. Dinesh Reddy, Robert Tamburo, Srinivasa G. Narasimhan,
Abstract summary: We introduce a novel framework for automatically generating a large, realistic dataset of dynamic objects under occlusions using time-lapse imagery. By leveraging off-the-shelf 2D (bounding box, segmentation, keypoint) and 3D (pose, shape) predictions as pseudo-groundtruth, unoccluded 3D objects are identified automatically and composited into the background in a clip-art style. Our method demonstrates significant improvements in both 2D and 3D reconstruction, particularly in scenarios with heavily occluded objects like vehicles and people in urban scenes.
Score: 20.014258835647716
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current methods for 2D and 3D object understanding struggle with severe occlusions in busy urban environments, partly due to the lack of large-scale labeled ground-truth annotations for learning occlusion. In this work, we introduce a novel framework for automatically generating a large, realistic dataset of dynamic objects under occlusions using freely available time-lapse imagery. By leveraging off-the-shelf 2D (bounding box, segmentation, keypoint) and 3D (pose, shape) predictions as pseudo-groundtruth, unoccluded 3D objects are identified automatically and composited into the background in a clip-art style, ensuring realistic appearances and physically accurate occlusion configurations. The resulting clip-art image with pseudo-groundtruth enables efficient training of object reconstruction methods that are robust to occlusions. Our method demonstrates significant improvements in both 2D and 3D reconstruction, particularly in scenarios with heavily occluded objects like vehicles and people in urban scenes.

Related papers

3D-RE-GEN: 3D Reconstruction of Indoor Scenes with a Generative Framework [7.4570191712029965]
3D-RE-GEN is a compositional framework that reconstructs a single image into textured 3D objects and a background.<n>Our pipeline integrates models for asset detection, reconstruction, and placement, pushing certain models beyond their originally intended domains.<n>Unlike current methods, 3D-RE-GEN generates a comprehensive background that spatially constrains objects during optimization.
arXiv Detail & Related papers (2025-12-19T11:20:52Z)
R3D2: Realistic 3D Asset Insertion via Diffusion for Autonomous Driving Simulation [78.26308457952636]
This paper introduces R3D2, a lightweight, one-step diffusion model designed to overcome limitations in autonomous driving simulation.<n>It enables realistic insertion of complete 3D assets into existing scenes by generating plausible rendering effects-such as shadows and consistent lighting-in real time.<n>We show that R3D2 significantly enhances the realism of inserted assets, enabling use-cases like text-to-3D asset insertion and cross-scene/dataset object transfer.
arXiv Detail & Related papers (2025-06-09T14:50:19Z)
HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation [50.206100327643284]
HiScene is a novel hierarchical framework that bridges the gap between 2D image generation and 3D object generation. We generate 3D content that aligns with 2D representations while maintaining compositional structure.
arXiv Detail & Related papers (2025-04-17T16:33:39Z)
Architect: Generating Vivid and Interactive 3D Scenes with Hierarchical 2D Inpainting [47.014044892025346]
Architect is a generative framework that creates complex and realistic 3D embodied environments leveraging diffusion-based 2D image inpainting. Our pipeline is further extended to a hierarchical and iterative inpainting process to continuously generate placement of large furniture and small objects to enrich the scene.
arXiv Detail & Related papers (2024-11-14T22:15:48Z)
Dynamic Scene Understanding through Object-Centric Voxelization and Neural Rendering [57.895846642868904]
We present a 3D generative model named DynaVol-S for dynamic scenes that enables object-centric learning. voxelization infers per-object occupancy probabilities at individual spatial locations. Our approach integrates 2D semantic features to create 3D semantic grids, representing the scene through multiple disentangled voxel grids.
arXiv Detail & Related papers (2024-07-30T15:33:58Z)
ARTIC3D: Learning Robust Articulated 3D Shapes from Noisy Web Image Collections [71.46546520120162]
Estimating 3D articulated shapes like animal bodies from monocular images is inherently challenging. We propose ARTIC3D, a self-supervised framework to reconstruct per-instance 3D shapes from a sparse image collection in-the-wild. We produce realistic animations by fine-tuning the rendered shape and texture under rigid part transformations.
arXiv Detail & Related papers (2023-06-07T17:47:50Z)
SSR-2D: Semantic 3D Scene Reconstruction from 2D Images [54.46126685716471]
In this work, we explore a central 3D scene modeling task, namely, semantic scene reconstruction without using any 3D annotations. The key idea of our approach is to design a trainable model that employs both incomplete 3D reconstructions and their corresponding source RGB-D images. Our method achieves the state-of-the-art performance of semantic scene completion on two large-scale benchmark datasets MatterPort3D and ScanNet.
arXiv Detail & Related papers (2023-02-07T17:47:52Z)
RiCS: A 2D Self-Occlusion Map for Harmonizing Volumetric Objects [68.85305626324694]
Ray-marching in Camera Space (RiCS) is a new method to represent the self-occlusions of foreground objects in 3D into a 2D self-occlusion map. We show that our representation map not only allows us to enhance the image quality but also to model temporally coherent complex shadow effects.
arXiv Detail & Related papers (2022-05-14T05:35:35Z)
DensePose 3D: Lifting Canonical Surface Maps of Articulated Objects to the Third Dimension [71.71234436165255]
We contribute DensePose 3D, a method that can learn such reconstructions in a weakly supervised fashion from 2D image annotations only. Because it does not require 3D scans, DensePose 3D can be used for learning a wide range of articulated categories such as different animal species. We show significant improvements compared to state-of-the-art non-rigid structure-from-motion baselines on both synthetic and real data on categories of humans and animals.
arXiv Detail & Related papers (2021-08-31T18:33:55Z)
Learning Canonical 3D Object Representation for Fine-Grained Recognition [77.33501114409036]
We propose a novel framework for fine-grained object recognition that learns to recover object variation in 3D space from a single image. We represent an object as a composition of 3D shape and its appearance, while eliminating the effect of camera viewpoint. By incorporating 3D shape and appearance jointly in a deep representation, our method learns the discriminative representation of the object.
arXiv Detail & Related papers (2021-08-10T12:19:34Z)
Fully Understanding Generic Objects: Modeling, Segmentation, and Reconstruction [33.95791350070165]
Inferring 3D structure of a generic object from a 2D image is a long-standing objective of computer vision. We take an alternative approach with semi-supervised learning. That is, for a 2D image of a generic object, we decompose it into latent representations of category, shape and albedo. We show that the complete shape and albedo modeling enables us to leverage real 2D images in both modeling and model fitting.
arXiv Detail & Related papers (2021-04-02T02:39:29Z)
SDF-SRN: Learning Signed Distance 3D Object Reconstruction from Static Images [44.78174845839193]
Recent efforts have turned to learning 3D reconstruction without 3D supervision from RGB images with annotated 2D silhouettes. These techniques still require multi-view annotations of the same object instance during training. We propose SDF-SRN, an approach that requires only a single view of objects at training time.
arXiv Detail & Related papers (2020-10-20T17:59:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.