PuTR: A Pure Transformer for Decoupled and Online Multi-Object Tracking
- URL: http://arxiv.org/abs/2405.14119v1
- Date: Thu, 23 May 2024 02:44:46 GMT
- Title: PuTR: A Pure Transformer for Decoupled and Online Multi-Object Tracking
- Authors: Chongwei Liu, Haojie Li, Zhihui Wang, Rui Xu,
- Abstract summary: We show that a pure Transformer can unify short- and long-term associations in a decoupled and online manner.
Experiments show that a classic Transformer architecture naturally suits the association problem and achieves a strong baseline.
This work pioneers a promising Transformer-based approach for the MOT task, and provides code to facilitate further research.
- Score: 36.5272157173876
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in Multi-Object Tracking (MOT) have achieved remarkable success in short-term association within the decoupled tracking-by-detection online paradigm. However, long-term tracking still remains a challenging task. Although graph-based approaches can address this issue by modeling trajectories as a graph in the decoupled manner, their non-online nature poses obstacles for real-time applications. In this paper, we demonstrate that the trajectory graph is a directed acyclic graph, which can be represented by an object sequence arranged by frame and a binary adjacency matrix. It is a coincidence that the binary matrix matches the attention mask in the Transformer, and the object sequence serves exactly as a natural input sequence. Intuitively, we propose that a pure Transformer can naturally unify short- and long-term associations in a decoupled and online manner. Our experiments show that a classic Transformer architecture naturally suits the association problem and achieves a strong baseline compared to existing foundational methods across four datasets: DanceTrack, SportsMOT, MOT17, and MOT20, as well as superior generalizability in domain shift. Moreover, the decoupled property also enables efficient training and inference. This work pioneers a promising Transformer-based approach for the MOT task, and provides code to facilitate further research. https://github.com/chongweiliu/PuTR
Related papers
- 3DMOTFormer: Graph Transformer for Online 3D Multi-Object Tracking [15.330384668966806]
State-of-the-art 3D multi-object tracking (MOT) approaches typically rely on non-learned model-based algorithms such as Kalman Filter.
We propose 3DMOTFormer, a learned geometry-based 3D MOT framework building upon the transformer architecture.
Our approach achieves 71.2% and 68.2% AMOTA on the nuScenes validation and test split, respectively.
arXiv Detail & Related papers (2023-08-12T19:19:58Z) - Graph Decision Transformer [83.76329715043205]
Graph Decision Transformer (GDT) is a novel offline reinforcement learning approach.
GDT models the input sequence into a causal graph to capture potential dependencies between fundamentally different concepts.
Our experiments show that GDT matches or surpasses the performance of state-of-the-art offline RL methods on image-based Atari and OpenAI Gym.
arXiv Detail & Related papers (2023-03-07T09:10:34Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - TransMOT: Spatial-Temporal Graph Transformer for Multiple Object
Tracking [74.82415271960315]
We propose a solution named TransMOT to efficiently model the spatial and temporal interactions among objects in a video.
TransMOT is not only more computationally efficient than the traditional Transformer, but it also achieves better tracking accuracy.
The proposed method is evaluated on multiple benchmark datasets including MOT15, MOT16, MOT17, and MOT20.
arXiv Detail & Related papers (2021-04-01T01:49:05Z) - Transformer Tracking [76.96796612225295]
Correlation acts as a critical role in the tracking field, especially in popular Siamese-based trackers.
This work presents a novel attention-based feature fusion network, which effectively combines the template and search region features solely using attention.
Experiments show that our TransT achieves very promising results on six challenging datasets.
arXiv Detail & Related papers (2021-03-29T09:06:55Z) - Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual
Tracking [47.205979159070445]
We bridge the individual video frames and explore the temporal contexts across them via a transformer architecture for robust object tracking.
Different from classic usage of the transformer in natural language processing tasks, we separate its encoder and decoder into two parallel branches.
Our method sets several new state-of-the-art records on prevalent tracking benchmarks.
arXiv Detail & Related papers (2021-03-22T09:20:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.