FrustumFormer: Adaptive Instance-aware Resampling for Multi-view 3D
Detection
- URL: http://arxiv.org/abs/2301.04467v2
- Date: Fri, 24 Mar 2023 09:18:39 GMT
- Title: FrustumFormer: Adaptive Instance-aware Resampling for Multi-view 3D
Detection
- Authors: Yuqi Wang, Yuntao Chen, and Zhaoxiang Zhang
- Abstract summary: We propose a novel framework named FrustumFormer, which pays more attention to the features in instance regions via adaptive instance-aware resampling.
Experiments on the nuScenes dataset demonstrate the effectiveness of FrustumFormer, and we achieve a new state-of-the-art performance on the benchmark.
- Score: 47.6570523164125
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The transformation of features from 2D perspective space to 3D space is
essential to multi-view 3D object detection. Recent approaches mainly focus on
the design of view transformation, either pixel-wisely lifting perspective view
features into 3D space with estimated depth or grid-wisely constructing BEV
features via 3D projection, treating all pixels or grids equally. However,
choosing what to transform is also important but has rarely been discussed
before. The pixels of a moving car are more informative than the pixels of the
sky. To fully utilize the information contained in images, the view
transformation should be able to adapt to different image regions according to
their contents. In this paper, we propose a novel framework named
FrustumFormer, which pays more attention to the features in instance regions
via adaptive instance-aware resampling. Specifically, the model obtains
instance frustums on the bird's eye view by leveraging image view object
proposals. An adaptive occupancy mask within the instance frustum is learned to
refine the instance location. Moreover, the temporal frustum intersection could
further reduce the localization uncertainty of objects. Comprehensive
experiments on the nuScenes dataset demonstrate the effectiveness of
FrustumFormer, and we achieve a new state-of-the-art performance on the
benchmark. Codes and models will be made available at
https://github.com/Robertwyq/Frustum.
Related papers
- Parametric Depth Based Feature Representation Learning for Object
Detection and Segmentation in Bird's Eye View [44.78243406441798]
This paper focuses on leveraging geometry information, such as depth, to model such feature transformation.
We first lift the 2D image features to the 3D space defined for the ego vehicle via a predicted parametric depth distribution for each pixel in each view.
We then aggregate the 3D feature volume based on the 3D space occupancy derived from depth to the BEV frame.
arXiv Detail & Related papers (2023-07-09T06:07:22Z) - Viewpoint Equivariance for Multi-View 3D Object Detection [35.4090127133834]
State-of-the-art methods focus on reasoning and decoding object bounding boxes from multi-view camera input.
We introduce VEDet, a novel 3D object detection framework that exploits 3D multi-view geometry.
arXiv Detail & Related papers (2023-03-25T19:56:41Z) - OA-BEV: Bringing Object Awareness to Bird's-Eye-View Representation for
Multi-Camera 3D Object Detection [78.38062015443195]
OA-BEV is a network that can be plugged into the BEV-based 3D object detection framework.
Our method achieves consistent improvements over the BEV-based baselines in terms of both average precision and nuScenes detection score.
arXiv Detail & Related papers (2023-01-13T06:02:31Z) - Scatter Points in Space: 3D Detection from Multi-view Monocular Images [8.71944437852952]
3D object detection from monocular image(s) is a challenging and long-standing problem of computer vision.
Recent methods tend to aggregate multiview feature by sampling regular 3D grid densely in space.
We propose a learnable keypoints sampling method, which scatters pseudo surface points in 3D space, in order to keep data sparsity.
arXiv Detail & Related papers (2022-08-31T09:38:05Z) - Vision Transformer for NeRF-Based View Synthesis from a Single Input
Image [49.956005709863355]
We propose to leverage both the global and local features to form an expressive 3D representation.
To synthesize a novel view, we train a multilayer perceptron (MLP) network conditioned on the learned 3D representation to perform volume rendering.
Our method can render novel views from only a single input image and generalize across multiple object categories using a single model.
arXiv Detail & Related papers (2022-07-12T17:52:04Z) - Pixel2Mesh++: 3D Mesh Generation and Refinement from Multi-View Images [82.32776379815712]
We study the problem of shape generation in 3D mesh representation from a small number of color images with or without camera poses.
We adopt to further improve the shape quality by leveraging cross-view information with a graph convolution network.
Our model is robust to the quality of the initial mesh and the error of camera pose, and can be combined with a differentiable function for test-time optimization.
arXiv Detail & Related papers (2022-04-21T03:42:31Z) - CoCoNets: Continuous Contrastive 3D Scene Representations [21.906643302668716]
This paper explores self-supervised learning of amodal 3D feature representations from RGB and RGB-D posed images and videos.
We show the resulting 3D visual feature representations effectively scale across objects and scenes, imagine information occluded or missing from the input viewpoints, track objects over time, align semantically related objects in 3D, and improve 3D object detection.
arXiv Detail & Related papers (2021-04-08T15:50:47Z) - Self-Supervised Multi-View Learning via Auto-Encoding 3D Transformations [61.870882736758624]
We propose a novel self-supervised paradigm to learn Multi-View Transformation Equivariant Representations (MV-TER)
Specifically, we perform a 3D transformation on a 3D object, and obtain multiple views before and after the transformation via projection.
Then, we self-train a representation to capture the intrinsic 3D object representation by decoding 3D transformation parameters from the fused feature representations of multiple views before and after the transformation.
arXiv Detail & Related papers (2021-03-01T06:24:17Z) - Weakly Supervised Learning of Multi-Object 3D Scene Decompositions Using
Deep Shape Priors [69.02332607843569]
PriSMONet is a novel approach for learning Multi-Object 3D scene decomposition and representations from single images.
A recurrent encoder regresses a latent representation of 3D shape, pose and texture of each object from an input RGB image.
We evaluate the accuracy of our model in inferring 3D scene layout, demonstrate its generative capabilities, assess its generalization to real images, and point out benefits of the learned representation.
arXiv Detail & Related papers (2020-10-08T14:49:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.