Related papers: Object-Scene-Camera Decomposition and Recomposition for Data-Efficient Monocular 3D Object Detection

Object-Scene-Camera Decomposition and Recomposition for Data-Efficient Monocular 3D Object Detection

URL: http://arxiv.org/abs/2602.20627v1
Date: Tue, 24 Feb 2026 07:22:58 GMT
Title: Object-Scene-Camera Decomposition and Recomposition for Data-Efficient Monocular 3D Object Detection
Authors: Zhaonian Kuang, Rui Ding, Meng Yang, Xinhu Zheng, Gang Hua,
Abstract summary: Monocular 3D object detection (M3OD) is intrinsically ill-posed, hence training a high-performance deep learning based M3OD model requires a humongous amount of labeled data.<n>We propose an online object-scene-camera decomposition and recomposition data manipulation scheme to more efficiently exploit the training data.
Score: 20.94974587800414
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Monocular 3D object detection (M3OD) is intrinsically ill-posed, hence training a high-performance deep learning based M3OD model requires a humongous amount of labeled data with complicated visual variation from diverse scenes, variety of objects and camera poses.However, we observe that, due to strong human bias, the three independent entities, i.e., object, scene, and camera pose, are always tightly entangled when an image is captured to construct training data. More specifically, specific 3D objects are always captured in particular scenes with fixed camera poses, and hence lacks necessary diversity. Such tight entanglement induces the challenging issues of insufficient utilization and overfitting to uniform training data. To mitigate this, we propose an online object-scene-camera decomposition and recomposition data manipulation scheme to more efficiently exploit the training data. We first fully decompose training images into textured 3D object point models and background scenes in an efficient computation and storage manner. We then continuously recompose new training images in each epoch by inserting the 3D objects into the freespace of the background scenes, and rendering them with perturbed camera poses from textured 3D point representation. In this way, the refreshed training data in all epochs can cover the full spectrum of independent object, scene, and camera pose combinations. This scheme can serve as a plug-and-play component to boost M3OD models, working flexibly with both fully and sparsely supervised settings. In the sparsely-supervised setting, objects closest to the ego-camera for all instances are sparsely annotated. We then can flexibly increase the annotated objects to control annotation cost. For validation, our method is widely applied to five representative M3OD models and evaluated on both the KITTI and the more complicated Waymo datasets.

Related papers

MonoSOWA: Scalable monocular 3D Object detector Without human Annotations [0.0]
We present a novel method to train a 3D object detector from a single RGB camera without domain-specific human annotations.<n>The method is evaluated on three public datasets, where despite using no human labels, it outperforms prior work by a significant margin.
arXiv Detail & Related papers (2025-01-16T11:35:22Z)
SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views [36.02533658048349]
We propose a novel method, SpaRP, to reconstruct a 3D textured mesh and estimate the relative camera poses for sparse-view images. SpaRP distills knowledge from 2D diffusion models and finetunes them to implicitly deduce the 3D spatial relationships between the sparse views. It requires only about 20 seconds to produce a textured mesh and camera poses for the input views.
arXiv Detail & Related papers (2024-08-19T17:53:10Z)
Unsupervised Learning of Category-Level 3D Pose from Object-Centric Videos [15.532504015622159]
Category-level 3D pose estimation is a fundamentally important problem in computer vision and robotics. We tackle the problem of learning to estimate the category-level 3D pose only from casually taken object-centric videos.
arXiv Detail & Related papers (2024-07-05T09:43:05Z)
DO3D: Self-supervised Learning of Decomposed Object-aware 3D Motion and Depth from Monocular Videos [76.01906393673897]
We propose a self-supervised method to jointly learn 3D motion and depth from monocular videos. Our system contains a depth estimation module to predict depth, and a new decomposed object-wise 3D motion (DO3D) estimation module to predict ego-motion and 3D object motion. Our model delivers superior performance in all evaluated settings.
arXiv Detail & Related papers (2024-03-09T12:22:46Z)
PF-LRM: Pose-Free Large Reconstruction Model for Joint Pose and Shape Prediction [77.89935657608926]
We propose a Pose-Free Large Reconstruction Model (PF-LRM) for reconstructing a 3D object from a few unposed images. PF-LRM simultaneously estimates the relative camera poses in 1.3 seconds on a single A100 GPU.
arXiv Detail & Related papers (2023-11-20T18:57:55Z)
AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core. The 3D autodecoder framework embeds properties learned from the target dataset in the latent space. We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z)
FlowCam: Training Generalizable 3D Radiance Fields without Camera Poses via Pixel-Aligned Scene Flow [26.528667940013598]
Reconstruction of 3D neural fields from posed images has emerged as a promising method for self-supervised representation learning. Key challenge preventing the deployment of these 3D scene learners on large-scale video data is their dependence on precise camera poses from structure-from-motion. We propose a method that jointly reconstructs camera poses and 3D neural scene representations online and in a single forward pass.
arXiv Detail & Related papers (2023-05-31T20:58:46Z)
3D Reconstruction of Objects in Hands without Real World 3D Supervision [12.70221786947807]
We propose modules to leverage 3D supervision to scale up the learning of models for reconstructing hand-held objects. Specifically, we extract multiview 2D mask supervision from videos and 3D shape priors from shape collections. We use these indirect 3D cues to train occupancy networks that predict the 3D shape of objects from a single RGB image.
arXiv Detail & Related papers (2023-05-04T17:56:48Z)
D3D-HOI: Dynamic 3D Human-Object Interactions from Videos [49.38319295373466]
We introduce D3D-HOI: a dataset of monocular videos with ground truth annotations of 3D object pose, shape and part motion during human-object interactions. Our dataset consists of several common articulated objects captured from diverse real-world scenes and camera viewpoints. We leverage the estimated 3D human pose for more accurate inference of the object spatial layout and dynamics.
arXiv Detail & Related papers (2021-08-19T00:49:01Z)
MetaPose: Fast 3D Pose from Multiple Views without 3D Supervision [72.5863451123577]
We show how to train a neural model that can perform accurate 3D pose and camera estimation. Our method outperforms both classical bundle adjustment and weakly-supervised monocular 3D baselines.
arXiv Detail & Related papers (2021-08-10T18:39:56Z)
CanonPose: Self-Supervised Monocular 3D Human Pose Estimation in the Wild [31.334715988245748]
We propose a self-supervised approach that learns a single image 3D pose estimator from unlabeled multi-view data. In contrast to most existing methods, we do not require calibrated cameras and can therefore learn from moving cameras. Key to the success are new, unbiased reconstruction objectives that mix information across views and training samples.
arXiv Detail & Related papers (2020-11-30T10:42:27Z)
Weakly Supervised Learning of Multi-Object 3D Scene Decompositions Using Deep Shape Priors [69.02332607843569]
PriSMONet is a novel approach for learning Multi-Object 3D scene decomposition and representations from single images. A recurrent encoder regresses a latent representation of 3D shape, pose and texture of each object from an input RGB image. We evaluate the accuracy of our model in inferring 3D scene layout, demonstrate its generative capabilities, assess its generalization to real images, and point out benefits of the learned representation.
arXiv Detail & Related papers (2020-10-08T14:49:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.