CAPE: Camera View Position Embedding for Multi-View 3D Object Detection
- URL: http://arxiv.org/abs/2303.10209v1
- Date: Fri, 17 Mar 2023 18:59:54 GMT
- Title: CAPE: Camera View Position Embedding for Multi-View 3D Object Detection
- Authors: Kaixin Xiong, Shi Gong, Xiaoqing Ye, Xiao Tan, Ji Wan, Errui Ding,
Jingdong Wang, Xiang Bai
- Abstract summary: Current query-based methods rely on global 3D position embeddings to learn the geometric correspondence between images and 3D space.
We propose a novel method based on CAmera view Position Embedding, called CAPE.
CAPE achieves state-of-the-art performance (61.0% NDS and 52.5% mAP) among all LiDAR-free methods on nuScenes dataset.
- Score: 100.02565745233247
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we address the problem of detecting 3D objects from multi-view
images. Current query-based methods rely on global 3D position embeddings (PE)
to learn the geometric correspondence between images and 3D space. We claim
that directly interacting 2D image features with global 3D PE could increase
the difficulty of learning view transformation due to the variation of camera
extrinsics. Thus we propose a novel method based on CAmera view Position
Embedding, called CAPE. We form the 3D position embeddings under the local
camera-view coordinate system instead of the global coordinate system, such
that 3D position embedding is free of encoding camera extrinsic parameters.
Furthermore, we extend our CAPE to temporal modeling by exploiting the object
queries of previous frames and encoding the ego-motion for boosting 3D object
detection. CAPE achieves state-of-the-art performance (61.0% NDS and 52.5% mAP)
among all LiDAR-free methods on nuScenes dataset. Codes and models are
available on \href{https://github.com/PaddlePaddle/Paddle3D}{Paddle3D} and
\href{https://github.com/kaixinbear/CAPE}{PyTorch Implementation}.
Related papers
- Generating 3D-Consistent Videos from Unposed Internet Photos [68.944029293283]
We train a scalable, 3D-aware video model without any 3D annotations such as camera parameters.
Our results suggest that we can scale up scene-level 3D learning using only 2D data such as videos and multiview internet photos.
arXiv Detail & Related papers (2024-11-20T18:58:31Z) - Mitigating Perspective Distortion-induced Shape Ambiguity in Image Crops [17.074716363691294]
Models for predicting 3D from a single image often work with crops around the object of interest and ignore the location of the object in the camera's field of view.
We propose Intrinsics-Aware Positional.
benchmarks (KPE), which incorporates information about the location of crops in the image and camera shapes.
Experiments on three popular 3D-from-a-single-image benchmarks: depth prediction on NYU, 3D object detection on KITTI & nuScenes, and predicting 3D of articulated objects on ARCTIC, show the benefits of KPE.
arXiv Detail & Related papers (2023-12-11T18:28:55Z) - EP2P-Loc: End-to-End 3D Point to 2D Pixel Localization for Large-Scale
Visual Localization [44.05930316729542]
We propose EP2P-Loc, a novel large-scale visual localization method for 3D point clouds.
To increase the number of inliers, we propose a simple algorithm to remove invisible 3D points in the image.
For the first time in this task, we employ a differentiable for end-to-end training.
arXiv Detail & Related papers (2023-09-14T07:06:36Z) - Viewpoint Equivariance for Multi-View 3D Object Detection [35.4090127133834]
State-of-the-art methods focus on reasoning and decoding object bounding boxes from multi-view camera input.
We introduce VEDet, a novel 3D object detection framework that exploits 3D multi-view geometry.
arXiv Detail & Related papers (2023-03-25T19:56:41Z) - SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving [98.74706005223685]
3D scene understanding plays a vital role in vision-based autonomous driving.
We propose a SurroundOcc method to predict the 3D occupancy with multi-camera images.
arXiv Detail & Related papers (2023-03-16T17:59:08Z) - Neural Correspondence Field for Object Pose Estimation [67.96767010122633]
We propose a method for estimating the 6DoF pose of a rigid object with an available 3D model from a single RGB image.
Unlike classical correspondence-based methods which predict 3D object coordinates at pixels of the input image, the proposed method predicts 3D object coordinates at 3D query points sampled in the camera frustum.
arXiv Detail & Related papers (2022-07-30T01:48:23Z) - DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries [43.02373021724797]
We introduce a framework for multi-camera 3D object detection.
Our method manipulates predictions directly in 3D space.
We achieve state-of-the-art performance on the nuScenes autonomous driving benchmark.
arXiv Detail & Related papers (2021-10-13T17:59:35Z) - Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled
Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras.
We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points.
Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z) - ZoomNet: Part-Aware Adaptive Zooming Neural Network for 3D Object
Detection [69.68263074432224]
We present a novel framework named ZoomNet for stereo imagery-based 3D detection.
The pipeline of ZoomNet begins with an ordinary 2D object detection model which is used to obtain pairs of left-right bounding boxes.
To further exploit the abundant texture cues in RGB images for more accurate disparity estimation, we introduce a conceptually straight-forward module -- adaptive zooming.
arXiv Detail & Related papers (2020-03-01T17:18:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.