Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D
Object Detection
- URL: http://arxiv.org/abs/2303.11926v2
- Date: Wed, 7 Jun 2023 08:08:16 GMT
- Title: Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D
Object Detection
- Authors: Shihao Wang, Yingfei Liu, Tiancai Wang, Ying Li, Xiangyu Zhang
- Abstract summary: We propose a long-sequence modeling framework, named StreamPETR, for multi-view 3D object detection.
StreamPETR achieves significant performance improvements only with negligible cost, compared to the single-frame baseline.
The lightweight version realizes 45.0% mAP and 31.7 FPS, outperforming the state-of-the-art method (SOLOFusion) by 2.3% mAP and 1.8x faster FPS.
- Score: 20.161887223481994
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a long-sequence modeling framework, named
StreamPETR, for multi-view 3D object detection. Built upon the sparse query
design in the PETR series, we systematically develop an object-centric temporal
mechanism. The model is performed in an online manner and the long-term
historical information is propagated through object queries frame by frame.
Besides, we introduce a motion-aware layer normalization to model the movement
of the objects. StreamPETR achieves significant performance improvements only
with negligible computation cost, compared to the single-frame baseline. On the
standard nuScenes benchmark, it is the first online multi-view method that
achieves comparable performance (67.6% NDS & 65.3% AMOTA) with lidar-based
methods. The lightweight version realizes 45.0% mAP and 31.7 FPS, outperforming
the state-of-the-art method (SOLOFusion) by 2.3% mAP and 1.8x faster FPS. Code
has been available at https://github.com/exiawsh/StreamPETR.git.
Related papers
- TAPVid-3D: A Benchmark for Tracking Any Point in 3D [63.060421798990845]
We introduce a new benchmark, TAPVid-3D, for evaluating the task of Tracking Any Point in 3D.
This benchmark will serve as a guidepost to improve our ability to understand precise 3D motion and surface deformation from monocular video.
arXiv Detail & Related papers (2024-07-08T13:28:47Z) - PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection [66.94819989912823]
We propose a point-trajectory transformer with long short-term memory for efficient temporal 3D object detection.
We use point clouds of current-frame objects and their historical trajectories as input to minimize the memory bank storage requirement.
We conduct extensive experiments on the large-scale dataset to demonstrate that our approach performs well against state-of-the-art methods.
arXiv Detail & Related papers (2023-12-13T18:59:13Z) - LEF: Late-to-Early Temporal Fusion for LiDAR 3D Object Detection [40.267769862404684]
We propose a late-to-early recurrent feature fusion scheme for 3D object detection using temporal LiDAR point clouds.
Our main motivation is fusing object-aware latent embeddings into the early stages of a 3D object detector.
arXiv Detail & Related papers (2023-09-28T21:58:25Z) - DORT: Modeling Dynamic Objects in Recurrent for Multi-Camera 3D Object
Detection and Tracking [67.34803048690428]
We propose to model Dynamic Objects in RecurrenT (DORT) to tackle this problem.
DORT extracts object-wise local volumes for motion estimation that also alleviates the heavy computational burden.
It is flexible and practical that can be plugged into most camera-based 3D object detectors.
arXiv Detail & Related papers (2023-03-29T12:33:55Z) - Rethinking Voxelization and Classification for 3D Object Detection [68.8204255655161]
The main challenge in 3D object detection from LiDAR point clouds is achieving real-time performance without affecting the reliability of the network.
We present a solution to improve network inference speed and precision at the same time by implementing a fast dynamic voxelizer.
In addition, we propose a lightweight detection sub-head model for classifying predicted objects and filter out false detected objects.
arXiv Detail & Related papers (2023-01-10T16:22:04Z) - Video based Object 6D Pose Estimation using Transformers [6.951360830202521]
VideoPose is an end-to-end attention based modelling architecture that attends to previous frames in order to estimate 6D Object Poses in videos.
Our architecture is able to capture and reason from long-range dependencies efficiently, thus iteratively refining over video sequences.
Our approach is on par with the state-of-the-art Transformer methods, and performs significantly better relative to CNN based approaches.
arXiv Detail & Related papers (2022-10-24T18:45:53Z) - CRT-6D: Fast 6D Object Pose Estimation with Cascaded Refinement
Transformers [51.142988196855484]
This paper introduces a novel method we call Cascaded Refinement Transformers, or CRT-6D.
We replace the commonly used dense intermediate representation with a sparse set of features sampled from the feature pyramid we call Os(Object Keypoint Features) where each element corresponds to an object keypoint.
We achieve inferences 2x faster than the closest real-time state of the art methods while supporting up to 21 objects on a single model.
arXiv Detail & Related papers (2022-10-21T04:06:52Z) - YOLO-ReT: Towards High Accuracy Real-time Object Detection on Edge GPUs [14.85882314822983]
In order to map deep neural network (DNN) based object detection models to edge devices, one typically needs to compress such models significantly.
In this paper, we propose a novel edge GPU friendly module for multi-scale feature interaction.
We also propose a novel learning backbone adoption inspired by the changing translational information flow across various tasks.
arXiv Detail & Related papers (2021-10-26T14:02:59Z) - BundleTrack: 6D Pose Tracking for Novel Objects without Instance or
Category-Level 3D Models [1.14219428942199]
This work proposes BundleTrack, a general framework for 6D pose tracking of objects.
An efficient implementation of the framework provides a real-time performance of 10Hz for the entire framework.
arXiv Detail & Related papers (2021-08-01T18:14:46Z) - When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable.
In order to achieve a better accuracy, we propose two lightweight modules.
DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers.
QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z) - 3D Object Detection and Tracking Based on Streaming Data [9.085584050311178]
We set up a dual-way network for 3D object detection based on ons, and then propagate predictions to non-key frames through a motion based algorithm guided by temporal information.
Our framework is not only shown to have significant improvements compared with frame-by-frame paradigm, but also proven to produce competitive results on KITTI Object Tracking Benchmark.
arXiv Detail & Related papers (2020-09-14T03:15:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.