Learning Spatio-Temporal Transformer for Visual Tracking
- URL: http://arxiv.org/abs/2103.17154v1
- Date: Wed, 31 Mar 2021 15:19:19 GMT
- Title: Learning Spatio-Temporal Transformer for Visual Tracking
- Authors: Bin Yan, Houwen Peng, Jianlong Fu, Dong Wang, Huchuan Lu
- Abstract summary: We present a new tracking architecture with an encoder-decoder transformer as the key component.
The whole method is end-to-end, does not need any postprocessing steps such as cosine window and bounding box smoothing.
The proposed tracker achieves state-of-the-art performance on five challenging short-term and long-term benchmarks, while running real-time speed, being 6x faster than Siam R-CNN.
- Score: 108.11680070733598
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present a new tracking architecture with an encoder-decoder
transformer as the key component. The encoder models the global spatio-temporal
feature dependencies between target objects and search regions, while the
decoder learns a query embedding to predict the spatial positions of the target
objects. Our method casts object tracking as a direct bounding box prediction
problem, without using any proposals or predefined anchors. With the
encoder-decoder transformer, the prediction of objects just uses a simple
fully-convolutional network, which estimates the corners of objects directly.
The whole method is end-to-end, does not need any postprocessing steps such as
cosine window and bounding box smoothing, thus largely simplifying existing
tracking pipelines. The proposed tracker achieves state-of-the-art performance
on five challenging short-term and long-term benchmarks, while running at
real-time speed, being 6x faster than Siam R-CNN. Code and models are
open-sourced at https://github.com/researchmm/Stark.
Related papers
- STCMOT: Spatio-Temporal Cohesion Learning for UAV-Based Multiple Object Tracking [13.269416985959404]
Multiple object tracking (MOT) in Unmanned Aerial Vehicle (UAV) videos is important for diverse applications in computer vision.
We propose a novel Spatio-Temporal Cohesion Multiple Object Tracking framework (STCMOT)
We use historical embedding features to model the representation of ReID and detection features in a sequential order.
Our framework sets a new state-of-the-art performance in MOTA and IDF1 metrics.
arXiv Detail & Related papers (2024-09-17T14:34:18Z) - Exploring Dynamic Transformer for Efficient Object Tracking [58.120191254379854]
We propose DyTrack, a dynamic transformer framework for efficient tracking.
DyTrack automatically learns to configure proper reasoning routes for various inputs, gaining better utilization of the available computational budget.
Experiments on multiple benchmarks demonstrate that DyTrack achieves promising speed-precision trade-offs with only a single model.
arXiv Detail & Related papers (2024-03-26T12:31:58Z) - Dense Optical Tracking: Connecting the Dots [82.79642869586587]
DOT is a novel, simple and efficient method for solving the problem of point tracking in a video.
We show that DOT is significantly more accurate than current optical flow techniques, outperforms sophisticated "universal trackers" like OmniMotion, and is on par with, or better than, the best point tracking algorithms like CoTracker.
arXiv Detail & Related papers (2023-12-01T18:59:59Z) - TrajectoryFormer: 3D Object Tracking Transformer with Predictive
Trajectory Hypotheses [51.60422927416087]
3D multi-object tracking (MOT) is vital for many applications including autonomous driving vehicles and service robots.
We present TrajectoryFormer, a novel point-cloud-based 3D MOT framework.
arXiv Detail & Related papers (2023-06-09T13:31:50Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - TrTr: Visual Tracking with Transformer [29.415900191169587]
We propose a novel tracker network based on a powerful attention mechanism called Transformer encoder-decoder architecture.
We design the classification and regression heads using the output of Transformer to localize target based on shape-agnostic anchor.
Our method performs favorably against state-of-the-art algorithms.
arXiv Detail & Related papers (2021-05-09T02:32:28Z) - Faster object tracking pipeline for real time tracking [0.0]
Multi-object tracking (MOT) is a challenging practical problem for vision based applications.
This paper showcases a generic pipeline which can be used to speed up detection based object tracking methods.
arXiv Detail & Related papers (2020-11-08T06:33:48Z) - End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem.
Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components.
The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.