MonoCInIS: Camera Independent Monocular 3D Object Detection using
Instance Segmentation
- URL: http://arxiv.org/abs/2110.00464v1
- Date: Fri, 1 Oct 2021 14:56:37 GMT
- Title: MonoCInIS: Camera Independent Monocular 3D Object Detection using
Instance Segmentation
- Authors: Jonas Heylen, Mark De Wolf, Bruno Dawagne, Marc Proesmans, Luc Van
Gool, Wim Abbeloos, Hazem Abdelkawy, Daniel Olmeda Reino
- Abstract summary: Methods need to have a degree of 'camera independence' in order to benefit from large and heterogeneous training data.
We show that more data does not automatically guarantee a better performance, but rather, methods need to have a degree of 'camera independence' in order to benefit from large and heterogeneous training data.
- Score: 55.96577490779591
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Monocular 3D object detection has recently shown promising results, however
there remain challenging problems. One of those is the lack of invariance to
different camera intrinsic parameters, which can be observed across different
3D object datasets. Little effort has been made to exploit the combination of
heterogeneous 3D object datasets. In contrast to general intuition, we show
that more data does not automatically guarantee a better performance, but
rather, methods need to have a degree of 'camera independence' in order to
benefit from large and heterogeneous training data. In this paper we propose a
category-level pose estimation method based on instance segmentation, using
camera independent geometric reasoning to cope with the varying camera
viewpoints and intrinsics of different datasets. Every pixel of an instance
predicts the object dimensions, the 3D object reference points projected in 2D
image space and, optionally, the local viewing angle. Camera intrinsics are
only used outside of the learned network to lift the predicted 2D reference
points to 3D. We surpass camera independent methods on the challenging KITTI3D
benchmark and show the key benefits compared to camera dependent methods.
Related papers
- Unsupervised Learning of Category-Level 3D Pose from Object-Centric Videos [15.532504015622159]
Category-level 3D pose estimation is a fundamentally important problem in computer vision and robotics.
We tackle the problem of learning to estimate the category-level 3D pose only from casually taken object-centric videos.
arXiv Detail & Related papers (2024-07-05T09:43:05Z) - SimPB: A Single Model for 2D and 3D Object Detection from Multiple Cameras [3.648972014796591]
We present a single model termed SimPB, which simultaneously detects 2D objects in the perspective view and 3D objects in the BEV space from multiple cameras.
A hybrid decoder consists of several multi-view 2D decoder layers and several 3D decoder layers, specifically designed for their respective detection tasks.
arXiv Detail & Related papers (2024-03-15T14:39:39Z) - Explicit3D: Graph Network with Spatial Inference for Single Image 3D
Object Detection [35.85544715234846]
We propose a dynamic sparse graph pipeline named Explicit3D based on object geometry and semantics features.
Our experimental results on the SUN RGB-D dataset demonstrate that our Explicit3D achieves better performance balance than the-state-of-the-art.
arXiv Detail & Related papers (2023-02-13T16:19:54Z) - Monocular 3D Object Detection with Depth from Motion [74.29588921594853]
We take advantage of camera ego-motion for accurate object depth estimation and detection.
Our framework, named Depth from Motion (DfM), then uses the established geometry to lift 2D image features to the 3D space and detects 3D objects thereon.
Our framework outperforms state-of-the-art methods by a large margin on the KITTI benchmark.
arXiv Detail & Related papers (2022-07-26T15:48:46Z) - Towards Model Generalization for Monocular 3D Object Detection [57.25828870799331]
We present an effective unified camera-generalized paradigm (CGP) for Mono3D object detection.
We also propose the 2D-3D geometry-consistent object scaling strategy (GCOS) to bridge the gap via an instance-level augment.
Our method called DGMono3D achieves remarkable performance on all evaluated datasets and surpasses the SoTA unsupervised domain adaptation scheme.
arXiv Detail & Related papers (2022-05-23T23:05:07Z) - Image-to-Lidar Self-Supervised Distillation for Autonomous Driving Data [80.14669385741202]
We propose a self-supervised pre-training method for 3D perception models tailored to autonomous driving data.
We leverage the availability of synchronized and calibrated image and Lidar sensors in autonomous driving setups.
Our method does not require any point cloud nor image annotations.
arXiv Detail & Related papers (2022-03-30T12:40:30Z) - DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries [43.02373021724797]
We introduce a framework for multi-camera 3D object detection.
Our method manipulates predictions directly in 3D space.
We achieve state-of-the-art performance on the nuScenes autonomous driving benchmark.
arXiv Detail & Related papers (2021-10-13T17:59:35Z) - MetaPose: Fast 3D Pose from Multiple Views without 3D Supervision [72.5863451123577]
We show how to train a neural model that can perform accurate 3D pose and camera estimation.
Our method outperforms both classical bundle adjustment and weakly-supervised monocular 3D baselines.
arXiv Detail & Related papers (2021-08-10T18:39:56Z) - YCB-M: A Multi-Camera RGB-D Dataset for Object Recognition and 6DoF Pose
Estimation [2.9972063833424216]
We present a dataset of 32 scenes that have been captured by 7 different 3D cameras, totaling 49,294 frames.
This allows evaluating the sensitivity of pose estimation algorithms to the specifics of the used camera.
arXiv Detail & Related papers (2020-04-24T11:14:04Z) - Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled
Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras.
We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points.
Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.