LiDAR-based Online 3D Video Object Detection with Graph-based Message
Passing and Spatiotemporal Transformer Attention
- URL: http://arxiv.org/abs/2004.01389v1
- Date: Fri, 3 Apr 2020 06:06:52 GMT
- Title: LiDAR-based Online 3D Video Object Detection with Graph-based Message
Passing and Spatiotemporal Transformer Attention
- Authors: Junbo Yin, Jianbing Shen, Chenye Guan, Dingfu Zhou, Ruigang Yang
- Abstract summary: 3D object detectors usually focus on the single-frame detection, while ignoring the information in consecutive point cloud frames.
In this paper, we propose an end-to-end online 3D video object detector that operates on point sequences.
- Score: 100.52873557168637
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing LiDAR-based 3D object detectors usually focus on the single-frame
detection, while ignoring the spatiotemporal information in consecutive point
cloud frames. In this paper, we propose an end-to-end online 3D video object
detector that operates on point cloud sequences. The proposed model comprises a
spatial feature encoding component and a spatiotemporal feature aggregation
component. In the former component, a novel Pillar Message Passing Network
(PMPNet) is proposed to encode each discrete point cloud frame. It adaptively
collects information for a pillar node from its neighbors by iterative message
passing, which effectively enlarges the receptive field of the pillar feature.
In the latter component, we propose an Attentive Spatiotemporal Transformer GRU
(AST-GRU) to aggregate the spatiotemporal information, which enhances the
conventional ConvGRU with an attentive memory gating mechanism. AST-GRU
contains a Spatial Transformer Attention (STA) module and a Temporal
Transformer Attention (TTA) module, which can emphasize the foreground objects
and align the dynamic objects, respectively. Experimental results demonstrate
that the proposed 3D video object detector achieves state-of-the-art
performance on the large-scale nuScenes benchmark.
Related papers
- PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection [66.94819989912823]
We propose a point-trajectory transformer with long short-term memory for efficient temporal 3D object detection.
We use point clouds of current-frame objects and their historical trajectories as input to minimize the memory bank storage requirement.
We conduct extensive experiments on the large-scale dataset to demonstrate that our approach performs well against state-of-the-art methods.
arXiv Detail & Related papers (2023-12-13T18:59:13Z) - MGTANet: Encoding Sequential LiDAR Points Using Long Short-Term
Motion-Guided Temporal Attention for 3D Object Detection [8.305942415868042]
Most LiDAR sensors generate a sequence of point clouds in real-time.
Recent studies have revealed that substantial performance improvement can be achieved by exploiting the context present in a sequence of point sets.
We propose a novel 3D object detection architecture, which can encode point cloud sequences acquired by multiple successive scans.
arXiv Detail & Related papers (2022-12-01T11:24:47Z) - D-Align: Dual Query Co-attention Network for 3D Object Detection Based
on Multi-frame Point Cloud Sequence [8.21339007493213]
Conventional 3D object detectors detect objects using a set of points acquired over a fixed duration.
Recent studies have shown that the performance of object detection can be further enhanced by utilizing point cloud sequences.
We propose D-Align, which can effectively produce strong bird's-eye-view (BEV) features by aligning and aggregating the features obtained from a sequence of point sets.
arXiv Detail & Related papers (2022-09-30T20:41:25Z) - TransPillars: Coarse-to-Fine Aggregation for Multi-Frame 3D Object
Detection [47.941714033657675]
3D object detection using point clouds has attracted increasing attention due to its wide applications in autonomous driving and robotics.
We design TransPillars, a novel transformer-based feature aggregation technique that exploits temporal features of consecutive point cloud frames.
Our proposed TransPillars achieves state-of-art performance as compared to existing multi-frame detection approaches.
arXiv Detail & Related papers (2022-08-04T15:41:43Z) - Graph Neural Network and Spatiotemporal Transformer Attention for 3D
Video Object Detection from Point Clouds [94.21415132135951]
We propose to detect 3D objects by exploiting temporal information in multiple frames.
We implement our algorithm based on prevalent anchor-based and anchor-free detectors.
arXiv Detail & Related papers (2022-07-26T05:16:28Z) - RBGNet: Ray-based Grouping for 3D Object Detection [104.98776095895641]
We propose the RBGNet framework, a voting-based 3D detector for accurate 3D object detection from point clouds.
We propose a ray-based feature grouping module, which aggregates the point-wise features on object surfaces using a group of determined rays.
Our model achieves state-of-the-art 3D detection performance on ScanNet V2 and SUN RGB-D with remarkable performance gains.
arXiv Detail & Related papers (2022-04-05T14:42:57Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - Temporal-Channel Transformer for 3D Lidar-Based Video Object Detection
in Autonomous Driving [121.44554957537613]
We propose a new transformer, called Temporal-Channel Transformer, to model the spatial-temporal domain and channel domain relationships for video object detecting from Lidar data.
Specifically, the temporal-channel encoder of the transformer is designed to encode the information of different channels and frames.
We achieve the state-of-the-art performance in grid voxel-based 3D object detection on the nuScenes benchmark.
arXiv Detail & Related papers (2020-11-27T09:35:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.