FSD V2: Improving Fully Sparse 3D Object Detection with Virtual Voxels
- URL: http://arxiv.org/abs/2308.03755v1
- Date: Mon, 7 Aug 2023 17:59:48 GMT
- Title: FSD V2: Improving Fully Sparse 3D Object Detection with Virtual Voxels
- Authors: Lue Fan, Feng Wang, Naiyan Wang, Zhaoxiang Zhang
- Abstract summary: We present FSDv2, an evolution that aims to simplify the previous FSDv1 while eliminating the inductive bias introduced by its handcrafted instance-level representation.
We develop a suite of components to complement the virtual voxel concept, including a virtual voxel encoder, a virtual voxel mixer, and a virtual voxel assignment strategy.
- Score: 57.05834683261658
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: LiDAR-based fully sparse architecture has garnered increasing attention.
FSDv1 stands out as a representative work, achieving impressive efficacy and
efficiency, albeit with intricate structures and handcrafted designs. In this
paper, we present FSDv2, an evolution that aims to simplify the previous FSDv1
while eliminating the inductive bias introduced by its handcrafted
instance-level representation, thus promoting better general applicability. To
this end, we introduce the concept of \textbf{virtual voxels}, which takes over
the clustering-based instance segmentation in FSDv1. Virtual voxels not only
address the notorious issue of the Center Feature Missing problem in fully
sparse detectors but also endow the framework with a more elegant and
streamlined approach. Consequently, we develop a suite of components to
complement the virtual voxel concept, including a virtual voxel encoder, a
virtual voxel mixer, and a virtual voxel assignment strategy. Through empirical
validation, we demonstrate that the virtual voxel mechanism is functionally
similar to the handcrafted clustering in FSDv1 while being more general. We
conduct experiments on three large-scale datasets: Waymo Open Dataset,
Argoverse 2 dataset, and nuScenes dataset. Our results showcase
state-of-the-art performance on all three datasets, highlighting the
superiority of FSDv2 in long-range scenarios and its general applicability to
achieve competitive performance across diverse scenarios. Moreover, we provide
comprehensive experimental analysis to elucidate the workings of FSDv2. To
foster reproducibility and further research, we have open-sourced FSDv2 at
https://github.com/tusen-ai/SST.
Related papers
- LargeAD: Large-Scale Cross-Sensor Data Pretraining for Autonomous Driving [52.83707400688378]
LargeAD is a versatile and scalable framework designed for large-scale 3D pretraining across diverse real-world driving datasets.
Our framework leverages VFMs to extract semantically rich superpixels from 2D images, which are aligned with LiDAR point clouds to generate high-quality contrastive samples.
Our approach delivers significant performance improvements over state-of-the-art methods in both linear probing and fine-tuning tasks for both LiDAR-based segmentation and object detection.
arXiv Detail & Related papers (2025-01-07T18:59:59Z) - All-in-One: Transferring Vision Foundation Models into Stereo Matching [13.781452399651887]
AIO-Stereo can flexibly select and transfer knowledge from multiple heterogeneous VFMs to a single stereo matching model.
We show that AIO-Stereo achieves start-of-the-art performance on multiple datasets and ranks $1st$ on the Middlebury dataset.
arXiv Detail & Related papers (2024-12-13T06:59:17Z) - XVO: Generalized Visual Odometry via Cross-Modal Self-Training [11.70220331540621]
XVO is a semi-supervised learning method for training generalized monocular Visual Odometry (VO) models.
In contrast to standard monocular VO approaches which often study a known calibration within a single dataset, XVO efficiently learns to recover relative pose with real-world scale.
We optimize the motion estimation model via self-training from large amounts of unconstrained and heterogeneous dash camera videos available on YouTube.
arXiv Detail & Related papers (2023-09-28T18:09:40Z) - SimVPv2: Towards Simple yet Powerful Spatiotemporal Predictive Learning [61.419914155985886]
We propose SimVPv2, a streamlined model that eliminates the need for Unet architectures for spatial and temporal modeling.
SimVPv2 not only simplifies the model architecture but also improves both performance and computational efficiency.
On the standard Moving MNIST benchmark, SimVPv2 achieves superior performance compared to SimVP, with fewer FLOPs, about half the training time and 60% faster inference efficiency.
arXiv Detail & Related papers (2022-11-22T08:01:33Z) - V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision
Transformer [58.71845618090022]
We build a holistic attention model, namely V2X-ViT, to fuse information across on-road agents.
V2X-ViT consists of alternating layers of heterogeneous multi-agent self-attention and multi-scale window self-attention.
To validate our approach, we create a large-scale V2X perception dataset.
arXiv Detail & Related papers (2022-03-20T20:18:25Z) - Towards Scale Consistent Monocular Visual Odometry by Learning from the
Virtual World [83.36195426897768]
We propose VRVO, a novel framework for retrieving the absolute scale from virtual data.
We first train a scale-aware disparity network using both monocular real images and stereo virtual data.
The resulting scale-consistent disparities are then integrated with a direct VO system.
arXiv Detail & Related papers (2022-03-11T01:51:54Z) - AFDetV2: Rethinking the Necessity of the Second Stage for Object
Detection from Point Clouds [15.72821609622122]
We develop a single-stage anchor-free network for 3D detection from point clouds.
We use a self-calibrated convolution block in the backbone, a keypoint auxiliary supervision, and an IoU prediction branch in the multi-task head.
We win the 1st place in the Real-Time 3D Challenge 2021.
arXiv Detail & Related papers (2021-12-16T21:22:17Z) - Full-Duplex Strategy for Video Object Segmentation [141.43983376262815]
Full- Strategy Network (FSNet) is a novel framework for video object segmentation (VOS)
Our FSNet performs the crossmodal feature-passing (i.e., transmission and receiving) simultaneously before fusion decoding stage.
We show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.
arXiv Detail & Related papers (2021-08-06T14:50:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.