DFA3D: 3D Deformable Attention For 2D-to-3D Feature Lifting
- URL: http://arxiv.org/abs/2307.12972v1
- Date: Mon, 24 Jul 2023 17:49:11 GMT
- Title: DFA3D: 3D Deformable Attention For 2D-to-3D Feature Lifting
- Authors: Hongyang Li, Hao Zhang, Zhaoyang Zeng, Shilong Liu, Feng Li, Tianhe
Ren, and Lei Zhang
- Abstract summary: We propose a new operator, called 3D DeFormable Attention (DFA3D), for 2D-to-3D feature lifting.
DFA3D transforms multi-view 2D image features into a unified 3D space for 3D object detection.
- Score: 28.709044035867596
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a new operator, called 3D DeFormable Attention
(DFA3D), for 2D-to-3D feature lifting, which transforms multi-view 2D image
features into a unified 3D space for 3D object detection. Existing feature
lifting approaches, such as Lift-Splat-based and 2D attention-based, either use
estimated depth to get pseudo LiDAR features and then splat them to a 3D space,
which is a one-pass operation without feature refinement, or ignore depth and
lift features by 2D attention mechanisms, which achieve finer semantics while
suffering from a depth ambiguity problem. In contrast, our DFA3D-based method
first leverages the estimated depth to expand each view's 2D feature map to 3D
and then utilizes DFA3D to aggregate features from the expanded 3D feature
maps. With the help of DFA3D, the depth ambiguity problem can be effectively
alleviated from the root, and the lifted features can be progressively refined
layer by layer, thanks to the Transformer-like architecture. In addition, we
propose a mathematically equivalent implementation of DFA3D which can
significantly improve its memory efficiency and computational speed. We
integrate DFA3D into several methods that use 2D attention-based feature
lifting with only a few modifications in code and evaluate on the nuScenes
dataset. The experiment results show a consistent improvement of +1.41\% mAP on
average, and up to +15.1\% mAP improvement when high-quality depth information
is available, demonstrating the superiority, applicability, and huge potential
of DFA3D. The code is available at
https://github.com/IDEA-Research/3D-deformable-attention.git.
Related papers
- ConDense: Consistent 2D/3D Pre-training for Dense and Sparse Features from Multi-View Images [47.682942867405224]
ConDense is a framework for 3D pre-training utilizing existing 2D networks and large-scale multi-view datasets.
We propose a novel 2D-3D joint training scheme to extract co-embedded 2D and 3D features in an end-to-end pipeline.
arXiv Detail & Related papers (2024-08-30T05:57:01Z) - Improving 2D Feature Representations by 3D-Aware Fine-Tuning [17.01280751430423]
Current visual foundation models are trained purely on unstructured 2D data.
We show that fine-tuning on 3D-aware data improves the quality of emerging semantic features.
arXiv Detail & Related papers (2024-07-29T17:59:21Z) - Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding [83.63231467746598]
We introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding.
We propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality.
arXiv Detail & Related papers (2024-04-11T17:59:45Z) - GOEmbed: Gradient Origin Embeddings for Representation Agnostic 3D Feature Learning [67.61509647032862]
We propose GOEmbed (Gradient Origin Embeddings) that encodes input 2D images into any 3D representation.
Unlike typical prior approaches in which input images are encoded using 2D features extracted from large pre-trained models, or customized features are designed to handle different 3D representations.
arXiv Detail & Related papers (2023-12-14T08:39:39Z) - BUOL: A Bottom-Up Framework with Occupancy-aware Lifting for Panoptic 3D
Scene Reconstruction From A Single Image [33.126045619754365]
BUOL is a framework with Occupancy-aware Lifting to address the two issues for panoptic 3D scene reconstruction from a single image.
Our method shows a tremendous performance advantage over state-of-the-art methods on synthetic dataset 3D-Front and real-world dataset Matterport3D.
arXiv Detail & Related papers (2023-06-01T17:56:49Z) - 3D-Aware Indoor Scene Synthesis with Depth Priors [62.82867334012399]
Existing methods fail to model indoor scenes due to the large diversity of room layouts and the objects inside.
We argue that indoor scenes do not have a shared intrinsic structure, and hence only using 2D images cannot adequately guide the model with the 3D geometry.
arXiv Detail & Related papers (2022-02-17T09:54:29Z) - FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection [78.00922683083776]
It is non-trivial to make a general adapted 2D detector work in this 3D task.
In this technical report, we study this problem with a practice built on fully convolutional single-stage detector.
Our solution achieves 1st place out of all the vision-only methods in the nuScenes 3D detection challenge of NeurIPS 2020.
arXiv Detail & Related papers (2021-04-22T09:35:35Z) - 3D-to-2D Distillation for Indoor Scene Parsing [78.36781565047656]
We present a new approach that enables us to leverage 3D features extracted from large-scale 3D data repository to enhance 2D features extracted from RGB images.
First, we distill 3D knowledge from a pretrained 3D network to supervise a 2D network to learn simulated 3D features from 2D features during the training.
Second, we design a two-stage dimension normalization scheme to calibrate the 2D and 3D features for better integration.
Third, we design a semantic-aware adversarial training model to extend our framework for training with unpaired 3D data.
arXiv Detail & Related papers (2021-04-06T02:22:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.