Temporal-Channel Transformer for 3D Lidar-Based Video Object Detection
in Autonomous Driving
- URL: http://arxiv.org/abs/2011.13628v1
- Date: Fri, 27 Nov 2020 09:35:39 GMT
- Title: Temporal-Channel Transformer for 3D Lidar-Based Video Object Detection
in Autonomous Driving
- Authors: Zhenxun Yuan, Xiao Song, Lei Bai, Wengang Zhou, Zhe Wang, Wanli Ouyang
- Abstract summary: We propose a new transformer, called Temporal-Channel Transformer, to model the spatial-temporal domain and channel domain relationships for video object detecting from Lidar data.
Specifically, the temporal-channel encoder of the transformer is designed to encode the information of different channels and frames.
We achieve the state-of-the-art performance in grid voxel-based 3D object detection on the nuScenes benchmark.
- Score: 121.44554957537613
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The strong demand of autonomous driving in the industry has lead to strong
interest in 3D object detection and resulted in many excellent 3D object
detection algorithms. However, the vast majority of algorithms only model
single-frame data, ignoring the temporal information of the sequence of data.
In this work, we propose a new transformer, called Temporal-Channel
Transformer, to model the spatial-temporal domain and channel domain
relationships for video object detecting from Lidar data. As a special design
of this transformer, the information encoded in the encoder is different from
that in the decoder, i.e. the encoder encodes temporal-channel information of
multiple frames while the decoder decodes the spatial-channel information for
the current frame in a voxel-wise manner. Specifically, the temporal-channel
encoder of the transformer is designed to encode the information of different
channels and frames by utilizing the correlation among features from different
channels and frames. On the other hand, the spatial decoder of the transformer
will decode the information for each location of the current frame. Before
conducting the object detection with detection head, the gate mechanism is
deployed for re-calibrating the features of current frame, which filters out
the object irrelevant information by repetitively refine the representation of
target frame along with the up-sampling process. Experimental results show that
we achieve the state-of-the-art performance in grid voxel-based 3D object
detection on the nuScenes benchmark.
Related papers
- Transformer-based stereo-aware 3D object detection from binocular images [82.85433941479216]
We explore the model design of Transformers in binocular 3D object detection.
To achieve this goal, we present TS3D, a Stereo-aware 3D object detector.
Our proposed TS3D achieves a 41.29% Moderate Car detection average precision on the KITTI test set and takes 88 ms to detect objects from each binocular image pair.
arXiv Detail & Related papers (2023-04-24T08:29:45Z) - Pedestrian Spatio-Temporal Information Fusion For Video Anomaly
Detection [1.5736899098702974]
An anomaly detection method is proposed to integrate the information of pedestrians.
Anomaly detection is realized according to the difference between the output frame and the true value.
The experimental results on the CUHK Avenue and ShanghaiTech datasets show that the proposed method is superior to the current mainstream video anomaly detection methods.
arXiv Detail & Related papers (2022-11-18T06:41:02Z) - Focused Decoding Enables 3D Anatomical Detection by Transformers [64.36530874341666]
We propose a novel Detection Transformer for 3D anatomical structure detection, dubbed Focused Decoder.
Focused Decoder leverages information from an anatomical region atlas to simultaneously deploy query anchors and restrict the cross-attention's field of view.
We evaluate our proposed approach on two publicly available CT datasets and demonstrate that Focused Decoder not only provides strong detection results and thus alleviates the need for a vast amount of annotated data but also exhibits exceptional and highly intuitive explainability of results via attention weights.
arXiv Detail & Related papers (2022-07-21T22:17:21Z) - TransVOD: End-to-end Video Object Detection with Spatial-Temporal
Transformers [96.981282736404]
We present TransVOD, the first end-to-end video object detection system based on spatial-temporal Transformer architectures.
Our proposed TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet VID with 90.0% mAP.
Our proposed TransVOD Lite also achieves the best speed and accuracy trade-off with 83.7% mAP while running at around 30 FPS.
arXiv Detail & Related papers (2022-01-13T16:17:34Z) - End-to-End Video Object Detection with Spatial-Temporal Transformers [33.40462554784311]
We present TransVOD, an end-to-end video object detection model based on a spatial-temporal Transformer architecture.
Our method does not need complicated post-processing methods such as Seq-NMS or Tubelet rescoring.
These designs boost the strong baseline deformable DETR by a significant margin (3%-4% mAP) on the ImageNet VID dataset.
arXiv Detail & Related papers (2021-05-23T11:44:22Z) - Learning Spatio-Temporal Transformer for Visual Tracking [108.11680070733598]
We present a new tracking architecture with an encoder-decoder transformer as the key component.
The whole method is end-to-end, does not need any postprocessing steps such as cosine window and bounding box smoothing.
The proposed tracker achieves state-of-the-art performance on five challenging short-term and long-term benchmarks, while running real-time speed, being 6x faster than Siam R-CNN.
arXiv Detail & Related papers (2021-03-31T15:19:19Z) - LiDAR-based Online 3D Video Object Detection with Graph-based Message
Passing and Spatiotemporal Transformer Attention [100.52873557168637]
3D object detectors usually focus on the single-frame detection, while ignoring the information in consecutive point cloud frames.
In this paper, we propose an end-to-end online 3D video object detector that operates on point sequences.
arXiv Detail & Related papers (2020-04-03T06:06:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.