TransVOD: End-to-end Video Object Detection with Spatial-Temporal
Transformers
- URL: http://arxiv.org/abs/2201.05047v2
- Date: Fri, 14 Jan 2022 07:19:08 GMT
- Title: TransVOD: End-to-end Video Object Detection with Spatial-Temporal
Transformers
- Authors: Qianyu Zhou, Xiangtai Li, Lu He, Yibo Yang, Guangliang Cheng, Yunhai
Tong, Lizhuang Ma, Dacheng Tao
- Abstract summary: We present TransVOD, the first end-to-end video object detection system based on spatial-temporal Transformer architectures.
Our proposed TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet VID with 90.0% mAP.
Our proposed TransVOD Lite also achieves the best speed and accuracy trade-off with 83.7% mAP while running at around 30 FPS.
- Score: 96.981282736404
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Detection Transformer (DETR) and Deformable DETR have been proposed to
eliminate the need for many hand-designed components in object detection while
demonstrating good performance as previous complex hand-crafted detectors.
However, their performance on Video Object Detection (VOD) has not been well
explored. In this paper, we present TransVOD, the first end-to-end video object
detection system based on spatial-temporal Transformer architectures. The first
goal of this paper is to streamline the pipeline of VOD, effectively removing
the need for many hand-crafted components for feature aggregation, e.g.,
optical flow model, relation networks. Besides, benefited from the object query
design in DETR, our method does not need complicated post-processing methods
such as Seq-NMS. In particular, we present a temporal Transformer to aggregate
both the spatial object queries and the feature memories of each frame. Our
temporal transformer consists of two components: Temporal Query Encoder (TQE)
to fuse object queries, and Temporal Deformable Transformer Decoder (TDTD) to
obtain current frame detection results. These designs boost the strong baseline
deformable DETR by a significant margin (3%-4% mAP) on the ImageNet VID
dataset. Then, we present two improved versions of TransVOD including
TransVOD++ and TransVOD Lite. The former fuses object-level information into
object query via dynamic convolution while the latter models the entire video
clips as the output to speed up the inference time. We give detailed analysis
of all three models in the experiment part. In particular, our proposed
TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet
VID with 90.0% mAP. Our proposed TransVOD Lite also achieves the best speed and
accuracy trade-off with 83.7% mAP while running at around 30 FPS on a single
V100 GPU device. Code and models will be available for further research.
Related papers
- SODFormer: Streaming Object Detection with Transformer Using Events and
Frames [31.293847706713052]
DA camera, streaming two complementary sensing modalities of asynchronous events and frames, has gradually been used to address major object detection challenges.
We propose a novel streaming object detector with SODFormer, which first integrates events and frames to continuously detect objects in an asynchronous manner.
arXiv Detail & Related papers (2023-08-08T04:53:52Z) - Spatio-Temporal Learnable Proposals for End-to-End Video Object
Detection [12.650574326251023]
We present SparseVOD, a novel video object detection pipeline that employs Sparse R-CNN to exploit temporal information.
Our method significantly improves the single-frame Sparse RCNN by 8%-9% in mAP.
arXiv Detail & Related papers (2022-10-05T16:17:55Z) - MODETR: Moving Object Detection with Transformers [2.4366811507669124]
Moving Object Detection (MOD) is a crucial task for the Autonomous Driving pipeline.
In this paper, we tackle this problem through multi-head attention mechanisms, both across the spatial and motion streams.
We propose MODETR; a Moving Object DEtection TRansformer network, comprised of multi-stream transformers for both spatial and motion modalities.
arXiv Detail & Related papers (2021-06-21T21:56:46Z) - End-to-End Video Object Detection with Spatial-Temporal Transformers [33.40462554784311]
We present TransVOD, an end-to-end video object detection model based on a spatial-temporal Transformer architecture.
Our method does not need complicated post-processing methods such as Seq-NMS or Tubelet rescoring.
These designs boost the strong baseline deformable DETR by a significant margin (3%-4% mAP) on the ImageNet VID dataset.
arXiv Detail & Related papers (2021-05-23T11:44:22Z) - TransMOT: Spatial-Temporal Graph Transformer for Multiple Object
Tracking [74.82415271960315]
We propose a solution named TransMOT to efficiently model the spatial and temporal interactions among objects in a video.
TransMOT is not only more computationally efficient than the traditional Transformer, but it also achieves better tracking accuracy.
The proposed method is evaluated on multiple benchmark datasets including MOT15, MOT16, MOT17, and MOT20.
arXiv Detail & Related papers (2021-04-01T01:49:05Z) - Spatiotemporal Transformer for Video-based Person Re-identification [102.58619642363958]
We show that, despite the strong learning ability, the vanilla Transformer suffers from an increased risk of over-fitting.
We propose a novel pipeline where the model is pre-trained on a set of synthesized video data and then transferred to the downstream domains.
The derived algorithm achieves significant accuracy gain on three popular video-based person re-identification benchmarks.
arXiv Detail & Related papers (2021-03-30T16:19:27Z) - Temporal-Channel Transformer for 3D Lidar-Based Video Object Detection
in Autonomous Driving [121.44554957537613]
We propose a new transformer, called Temporal-Channel Transformer, to model the spatial-temporal domain and channel domain relationships for video object detecting from Lidar data.
Specifically, the temporal-channel encoder of the transformer is designed to encode the information of different channels and frames.
We achieve the state-of-the-art performance in grid voxel-based 3D object detection on the nuScenes benchmark.
arXiv Detail & Related papers (2020-11-27T09:35:39Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z) - LiDAR-based Online 3D Video Object Detection with Graph-based Message
Passing and Spatiotemporal Transformer Attention [100.52873557168637]
3D object detectors usually focus on the single-frame detection, while ignoring the information in consecutive point cloud frames.
In this paper, we propose an end-to-end online 3D video object detector that operates on point sequences.
arXiv Detail & Related papers (2020-04-03T06:06:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.