FSD V2: Improving Fully Sparse 3D Object Detection with Virtual Voxels
- URL: http://arxiv.org/abs/2308.03755v1
- Date: Mon, 7 Aug 2023 17:59:48 GMT
- Title: FSD V2: Improving Fully Sparse 3D Object Detection with Virtual Voxels
- Authors: Lue Fan, Feng Wang, Naiyan Wang, Zhaoxiang Zhang
- Abstract summary: We present FSDv2, an evolution that aims to simplify the previous FSDv1 while eliminating the inductive bias introduced by its handcrafted instance-level representation.
We develop a suite of components to complement the virtual voxel concept, including a virtual voxel encoder, a virtual voxel mixer, and a virtual voxel assignment strategy.
- Score: 57.05834683261658
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: LiDAR-based fully sparse architecture has garnered increasing attention.
FSDv1 stands out as a representative work, achieving impressive efficacy and
efficiency, albeit with intricate structures and handcrafted designs. In this
paper, we present FSDv2, an evolution that aims to simplify the previous FSDv1
while eliminating the inductive bias introduced by its handcrafted
instance-level representation, thus promoting better general applicability. To
this end, we introduce the concept of \textbf{virtual voxels}, which takes over
the clustering-based instance segmentation in FSDv1. Virtual voxels not only
address the notorious issue of the Center Feature Missing problem in fully
sparse detectors but also endow the framework with a more elegant and
streamlined approach. Consequently, we develop a suite of components to
complement the virtual voxel concept, including a virtual voxel encoder, a
virtual voxel mixer, and a virtual voxel assignment strategy. Through empirical
validation, we demonstrate that the virtual voxel mechanism is functionally
similar to the handcrafted clustering in FSDv1 while being more general. We
conduct experiments on three large-scale datasets: Waymo Open Dataset,
Argoverse 2 dataset, and nuScenes dataset. Our results showcase
state-of-the-art performance on all three datasets, highlighting the
superiority of FSDv2 in long-range scenarios and its general applicability to
achieve competitive performance across diverse scenarios. Moreover, we provide
comprehensive experimental analysis to elucidate the workings of FSDv2. To
foster reproducibility and further research, we have open-sourced FSDv2 at
https://github.com/tusen-ai/SST.
Related papers
- Retain, Blend, and Exchange: A Quality-aware Spatial-Stereo Fusion Approach for Event Stream Recognition [57.74076383449153]
We propose a novel dual-stream framework for event stream-based pattern recognition via differentiated fusion, termed EFV++.
It models two common event representations simultaneously, i.e., event images and event voxels.
We achieve new state-of-the-art performance on the Bullying10k dataset, i.e., $90.51%$, which exceeds the second place by $+2.21%$.
arXiv Detail & Related papers (2024-06-27T02:32:46Z) - Virtually Enriched NYU Depth V2 Dataset for Monocular Depth Estimation: Do We Need Artificial Augmentation? [61.234412062595155]
We present ANYU, a new virtually augmented version of the NYU depth v2 dataset, designed for monocular depth estimation.
In contrast to the well-known approach where full 3D scenes of a virtual world are utilized to generate artificial datasets, ANYU was created by incorporating RGB-D representations of virtual reality objects.
We show that ANYU improves the monocular depth estimation performance and generalization of deep neural networks with considerably different architectures.
arXiv Detail & Related papers (2024-04-15T05:44:03Z) - XVO: Generalized Visual Odometry via Cross-Modal Self-Training [11.70220331540621]
XVO is a semi-supervised learning method for training generalized monocular Visual Odometry (VO) models.
In contrast to standard monocular VO approaches which often study a known calibration within a single dataset, XVO efficiently learns to recover relative pose with real-world scale.
We optimize the motion estimation model via self-training from large amounts of unconstrained and heterogeneous dash camera videos available on YouTube.
arXiv Detail & Related papers (2023-09-28T18:09:40Z) - GEM: Boost Simple Network for Glass Surface Segmentation via Vision Foundation Models [7.423981028880871]
Glass surface detection is a challenging task due to the inherent ambiguity in their transparency and reflective characteristics.
We propose to address these issues by fully harnessing the capabilities of two existing vision foundation models (VFMs): Stable Diffusion and Segment Anything Model (SAM)
Our GEM establishes a new state-of-the-art performance with the help of these two VFMs, surpassing the best-reported method GlassSemNet with an IoU improvement of 2.1%.
arXiv Detail & Related papers (2023-07-22T08:37:23Z) - Virtual Homogeneity Learning: Defending against Data Heterogeneity in
Federated Learning [34.97057620481504]
We propose a new approach named virtual homogeneity learning (VHL) to "rectify" the data heterogeneity.
VHL conducts federated learning with a virtual homogeneous dataset crafted to satisfy two conditions: containing no private information and being separable.
Empirically, we demonstrate that VHL endows federated learning with drastically improved convergence speed and generalization performance.
arXiv Detail & Related papers (2022-06-06T10:02:21Z) - V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision
Transformer [58.71845618090022]
We build a holistic attention model, namely V2X-ViT, to fuse information across on-road agents.
V2X-ViT consists of alternating layers of heterogeneous multi-agent self-attention and multi-scale window self-attention.
To validate our approach, we create a large-scale V2X perception dataset.
arXiv Detail & Related papers (2022-03-20T20:18:25Z) - Towards Scale Consistent Monocular Visual Odometry by Learning from the
Virtual World [83.36195426897768]
We propose VRVO, a novel framework for retrieving the absolute scale from virtual data.
We first train a scale-aware disparity network using both monocular real images and stereo virtual data.
The resulting scale-consistent disparities are then integrated with a direct VO system.
arXiv Detail & Related papers (2022-03-11T01:51:54Z) - AFDetV2: Rethinking the Necessity of the Second Stage for Object
Detection from Point Clouds [15.72821609622122]
We develop a single-stage anchor-free network for 3D detection from point clouds.
We use a self-calibrated convolution block in the backbone, a keypoint auxiliary supervision, and an IoU prediction branch in the multi-task head.
We win the 1st place in the Real-Time 3D Challenge 2021.
arXiv Detail & Related papers (2021-12-16T21:22:17Z) - Full-Duplex Strategy for Video Object Segmentation [141.43983376262815]
Full- Strategy Network (FSNet) is a novel framework for video object segmentation (VOS)
Our FSNet performs the crossmodal feature-passing (i.e., transmission and receiving) simultaneously before fusion decoding stage.
We show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.
arXiv Detail & Related papers (2021-08-06T14:50:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.