3DPPE: 3D Point Positional Encoding for Multi-Camera 3D Object Detection
Transformers
- URL: http://arxiv.org/abs/2211.14710v3
- Date: Fri, 28 Jul 2023 02:31:31 GMT
- Title: 3DPPE: 3D Point Positional Encoding for Multi-Camera 3D Object Detection
Transformers
- Authors: Changyong Shu, JIajun Deng, Fisher Yu and Yifan Liu
- Abstract summary: We introduce 3D point positional encoding, 3DPPE, to the 3D detection Transformer decoder.
Despite the approximation, 3DPPE achieves 46.0 mAP and 51.4 NDS on the competitive nuScenes dataset.
- Score: 35.14784758217257
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based methods have swept the benchmarks on 2D and 3D detection on
images. Because tokenization before the attention mechanism drops the spatial
information, positional encoding becomes critical for those methods. Recent
works found that encodings based on samples of the 3D viewing rays can
significantly improve the quality of multi-camera 3D object detection. We
hypothesize that 3D point locations can provide more information than rays.
Therefore, we introduce 3D point positional encoding, 3DPPE, to the 3D
detection Transformer decoder. Although 3D measurements are not available at
the inference time of monocular 3D object detection, 3DPPE uses predicted depth
to approximate the real point positions. Our hybriddepth module combines direct
and categorical depth to estimate the refined depth of each pixel. Despite the
approximation, 3DPPE achieves 46.0 mAP and 51.4 NDS on the competitive nuScenes
dataset, significantly outperforming encodings based on ray samples. We make
the codes available at https://github.com/drilistbox/3DPPE.
Related papers
- Pixel-Aligned Recurrent Queries for Multi-View 3D Object Detection [16.677107631803327]
PARQ is a multi-view 3D object detector with transformer and pixel-aligned recurrent queries.
It can leverage additional input views without retraining, and can adapt inference compute by changing the number of recurrent iterations.
arXiv Detail & Related papers (2023-10-02T17:58:51Z) - V-DETR: DETR with Vertex Relative Position Encoding for 3D Object
Detection [73.37781484123536]
We introduce a highly performant 3D object detector for point clouds using the DETR framework.
To address the limitation, we introduce a novel 3D Relative Position (3DV-RPE) method.
We show exceptional results on the challenging ScanNetV2 benchmark.
arXiv Detail & Related papers (2023-08-08T17:14:14Z) - Transformer-based stereo-aware 3D object detection from binocular images [88.8899428219077]
We explore the model of Transformers in binocular 3D object detection.
To achieve this goal, we present TS3D, a Transformer-based 3D object detector.
Our proposed TS3D achieves a 41.29% Moderate Car detection average precision on the KITTI test set and takes 88 ms to detect objects from each binocular image pair.
arXiv Detail & Related papers (2023-04-24T08:29:45Z) - Viewpoint Equivariance for Multi-View 3D Object Detection [35.4090127133834]
State-of-the-art methods focus on reasoning and decoding object bounding boxes from multi-view camera input.
We introduce VEDet, a novel 3D object detection framework that exploits 3D multi-view geometry.
arXiv Detail & Related papers (2023-03-25T19:56:41Z) - CAPE: Camera View Position Embedding for Multi-View 3D Object Detection [100.02565745233247]
Current query-based methods rely on global 3D position embeddings to learn the geometric correspondence between images and 3D space.
We propose a novel method based on CAmera view Position Embedding, called CAPE.
CAPE achieves state-of-the-art performance (61.0% NDS and 52.5% mAP) among all LiDAR-free methods on nuScenes dataset.
arXiv Detail & Related papers (2023-03-17T18:59:54Z) - SparseDet: Towards End-to-End 3D Object Detection [12.3069609175534]
We propose SparseDet for end-to-end 3D object detection from point cloud.
As a new detection paradigm, SparseDet maintains a fixed set of learnable proposals to represent latent candidates.
SparseDet achieves highly competitive detection accuracy while running with a more efficient speed of 34.5 FPS.
arXiv Detail & Related papers (2022-06-02T09:49:53Z) - PETR: Position Embedding Transformation for Multi-View 3D Object
Detection [80.93664973321168]
PETR encodes the position information of 3D coordinates into image features, producing the 3D position-aware features.
PETR achieves state-of-the-art performance on standard nuScenes dataset and ranks 1st place on the benchmark.
arXiv Detail & Related papers (2022-03-10T20:33:28Z) - DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries [43.02373021724797]
We introduce a framework for multi-camera 3D object detection.
Our method manipulates predictions directly in 3D space.
We achieve state-of-the-art performance on the nuScenes autonomous driving benchmark.
arXiv Detail & Related papers (2021-10-13T17:59:35Z) - End-to-End Pseudo-LiDAR for Image-Based 3D Object Detection [62.34374949726333]
Pseudo-LiDAR (PL) has led to a drastic reduction in the accuracy gap between methods based on LiDAR sensors and those based on cheap stereo cameras.
PL combines state-of-the-art deep neural networks for 3D depth estimation with those for 3D object detection by converting 2D depth map outputs to 3D point cloud inputs.
We introduce a new framework based on differentiable Change of Representation (CoR) modules that allow the entire PL pipeline to be trained end-to-end.
arXiv Detail & Related papers (2020-04-07T02:18:38Z) - DSGN: Deep Stereo Geometry Network for 3D Object Detection [79.16397166985706]
There is a large performance gap between image-based and LiDAR-based 3D object detectors.
Our method, called Deep Stereo Geometry Network (DSGN), significantly reduces this gap.
For the first time, we provide a simple and effective one-stage stereo-based 3D detection pipeline.
arXiv Detail & Related papers (2020-01-10T11:44:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.