End-to-End Video Object Detection with Spatial-Temporal Transformers
- URL: http://arxiv.org/abs/2105.10920v1
- Date: Sun, 23 May 2021 11:44:22 GMT
- Title: End-to-End Video Object Detection with Spatial-Temporal Transformers
- Authors: Lu He, Qianyu Zhou, Xiangtai Li, Li Niu, Guangliang Cheng, Xiao Li,
Wenxuan Liu, Yunhai Tong, Lizhuang Ma, Liqing Zhang
- Abstract summary: We present TransVOD, an end-to-end video object detection model based on a spatial-temporal Transformer architecture.
Our method does not need complicated post-processing methods such as Seq-NMS or Tubelet rescoring.
These designs boost the strong baseline deformable DETR by a significant margin (3%-4% mAP) on the ImageNet VID dataset.
- Score: 33.40462554784311
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, DETR and Deformable DETR have been proposed to eliminate the need
for many hand-designed components in object detection while demonstrating good
performance as previous complex hand-crafted detectors. However, their
performance on Video Object Detection (VOD) has not been well explored. In this
paper, we present TransVOD, an end-to-end video object detection model based on
a spatial-temporal Transformer architecture. The goal of this paper is to
streamline the pipeline of VOD, effectively removing the need for many
hand-crafted components for feature aggregation, e.g., optical flow, recurrent
neural networks, relation networks. Besides, benefited from the object query
design in DETR, our method does not need complicated post-processing methods
such as Seq-NMS or Tubelet rescoring, which keeps the pipeline simple and
clean. In particular, we present temporal Transformer to aggregate both the
spatial object queries and the feature memories of each frame. Our temporal
Transformer consists of three components: Temporal Deformable Transformer
Encoder (TDTE) to encode the multiple frame spatial details, Temporal Query
Encoder (TQE) to fuse object queries, and Temporal Deformable Transformer
Decoder to obtain current frame detection results. These designs boost the
strong baseline deformable DETR by a significant margin (3%-4% mAP) on the
ImageNet VID dataset. TransVOD yields comparable results performance on the
benchmark of ImageNet VID. We hope our TransVOD can provide a new perspective
for video object detection. Code will be made publicly available at
https://github.com/SJTU-LuHe/TransVOD.
Related papers
- SODFormer: Streaming Object Detection with Transformer Using Events and
Frames [31.293847706713052]
DA camera, streaming two complementary sensing modalities of asynchronous events and frames, has gradually been used to address major object detection challenges.
We propose a novel streaming object detector with SODFormer, which first integrates events and frames to continuously detect objects in an asynchronous manner.
arXiv Detail & Related papers (2023-08-08T04:53:52Z) - FAQ: Feature Aggregated Queries for Transformer-based Video Object
Detectors [37.38250825377456]
We take a different perspective on video object detection. In detail, we improve the qualities of queries for the Transformer-based models by aggregation.
On the challenging ImageNet VID benchmark, when integrated with our proposed modules, the current state-of-the-art Transformer-based object detectors can be improved by more than 2.4% on mAP and 4.2% on AP50.
arXiv Detail & Related papers (2023-03-15T02:14:56Z) - Graph Neural Network and Spatiotemporal Transformer Attention for 3D
Video Object Detection from Point Clouds [94.21415132135951]
We propose to detect 3D objects by exploiting temporal information in multiple frames.
We implement our algorithm based on prevalent anchor-based and anchor-free detectors.
arXiv Detail & Related papers (2022-07-26T05:16:28Z) - TransVOD: End-to-end Video Object Detection with Spatial-Temporal
Transformers [96.981282736404]
We present TransVOD, the first end-to-end video object detection system based on spatial-temporal Transformer architectures.
Our proposed TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet VID with 90.0% mAP.
Our proposed TransVOD Lite also achieves the best speed and accuracy trade-off with 83.7% mAP while running at around 30 FPS.
arXiv Detail & Related papers (2022-01-13T16:17:34Z) - Spatiotemporal Transformer for Video-based Person Re-identification [102.58619642363958]
We show that, despite the strong learning ability, the vanilla Transformer suffers from an increased risk of over-fitting.
We propose a novel pipeline where the model is pre-trained on a set of synthesized video data and then transferred to the downstream domains.
The derived algorithm achieves significant accuracy gain on three popular video-based person re-identification benchmarks.
arXiv Detail & Related papers (2021-03-30T16:19:27Z) - Temporal-Channel Transformer for 3D Lidar-Based Video Object Detection
in Autonomous Driving [121.44554957537613]
We propose a new transformer, called Temporal-Channel Transformer, to model the spatial-temporal domain and channel domain relationships for video object detecting from Lidar data.
Specifically, the temporal-channel encoder of the transformer is designed to encode the information of different channels and frames.
We achieve the state-of-the-art performance in grid voxel-based 3D object detection on the nuScenes benchmark.
arXiv Detail & Related papers (2020-11-27T09:35:39Z) - End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem.
Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components.
The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z) - LiDAR-based Online 3D Video Object Detection with Graph-based Message
Passing and Spatiotemporal Transformer Attention [100.52873557168637]
3D object detectors usually focus on the single-frame detection, while ignoring the information in consecutive point cloud frames.
In this paper, we propose an end-to-end online 3D video object detector that operates on point sequences.
arXiv Detail & Related papers (2020-04-03T06:06:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.