Pixel-Aligned Recurrent Queries for Multi-View 3D Object Detection
- URL: http://arxiv.org/abs/2310.01401v1
- Date: Mon, 2 Oct 2023 17:58:51 GMT
- Title: Pixel-Aligned Recurrent Queries for Multi-View 3D Object Detection
- Authors: Yiming Xie, Huaizu Jiang, Georgia Gkioxari, Julian Straub
- Abstract summary: PARQ is a multi-view 3D object detector with transformer and pixel-aligned recurrent queries.
It can leverage additional input views without retraining, and can adapt inference compute by changing the number of recurrent iterations.
- Score: 16.677107631803327
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We present PARQ - a multi-view 3D object detector with transformer and
pixel-aligned recurrent queries. Unlike previous works that use learnable
features or only encode 3D point positions as queries in the decoder, PARQ
leverages appearance-enhanced queries initialized from reference points in 3D
space and updates their 3D location with recurrent cross-attention operations.
Incorporating pixel-aligned features and cross attention enables the model to
encode the necessary 3D-to-2D correspondences and capture global contextual
information of the input images. PARQ outperforms prior best methods on the
ScanNet and ARKitScenes datasets, learns and detects faster, is more robust to
distribution shifts in reference points, can leverage additional input views
without retraining, and can adapt inference compute by changing the number of
recurrent iterations.
Related papers
- 3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features [70.50665869806188]
3DiffTection is a state-of-the-art method for 3D object detection from single images.
We fine-tune a diffusion model to perform novel view synthesis conditioned on a single image.
We further train the model on target data with detection supervision.
arXiv Detail & Related papers (2023-11-07T23:46:41Z) - V-DETR: DETR with Vertex Relative Position Encoding for 3D Object
Detection [73.37781484123536]
We introduce a highly performant 3D object detector for point clouds using the DETR framework.
To address the limitation, we introduce a novel 3D Relative Position (3DV-RPE) method.
We show exceptional results on the challenging ScanNetV2 benchmark.
arXiv Detail & Related papers (2023-08-08T17:14:14Z) - Viewpoint Equivariance for Multi-View 3D Object Detection [35.4090127133834]
State-of-the-art methods focus on reasoning and decoding object bounding boxes from multi-view camera input.
We introduce VEDet, a novel 3D object detection framework that exploits 3D multi-view geometry.
arXiv Detail & Related papers (2023-03-25T19:56:41Z) - CAPE: Camera View Position Embedding for Multi-View 3D Object Detection [100.02565745233247]
Current query-based methods rely on global 3D position embeddings to learn the geometric correspondence between images and 3D space.
We propose a novel method based on CAmera view Position Embedding, called CAPE.
CAPE achieves state-of-the-art performance (61.0% NDS and 52.5% mAP) among all LiDAR-free methods on nuScenes dataset.
arXiv Detail & Related papers (2023-03-17T18:59:54Z) - 3DPPE: 3D Point Positional Encoding for Multi-Camera 3D Object Detection
Transformers [35.14784758217257]
We introduce 3D point positional encoding, 3DPPE, to the 3D detection Transformer decoder.
Despite the approximation, 3DPPE achieves 46.0 mAP and 51.4 NDS on the competitive nuScenes dataset.
arXiv Detail & Related papers (2022-11-27T03:36:32Z) - Bridged Transformer for Vision and Point Cloud 3D Object Detection [92.86856146086316]
Bridged Transformer (BrT) is an end-to-end architecture for 3D object detection.
BrT learns to identify 3D and 2D object bounding boxes from both points and image patches.
We experimentally show that BrT surpasses state-of-the-art methods on SUN RGB-D and ScanNetV2 datasets.
arXiv Detail & Related papers (2022-10-04T05:44:22Z) - CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework.
Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene.
In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z) - SRCN3D: Sparse R-CNN 3D for Compact Convolutional Multi-View 3D Object
Detection and Tracking [12.285423418301683]
This paper proposes Sparse R-CNN 3D (SRCN3D), a novel two-stage fully-sparse detector that incorporates sparse queries, sparse attention with box-wise sampling, and sparse prediction.
Experiments on nuScenes dataset demonstrate that SRCN3D achieves competitive performance in both 3D object detection and multi-object tracking tasks.
arXiv Detail & Related papers (2022-06-29T07:58:39Z) - Point2Seq: Detecting 3D Objects as Sequences [58.63662049729309]
We present a simple and effective framework, named Point2Seq, for 3D object detection from point clouds.
We view each 3D object as a sequence of words and reformulate the 3D object detection task as decoding words from 3D scenes in an auto-regressive manner.
arXiv Detail & Related papers (2022-03-25T00:20:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.