Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking
- URL: http://arxiv.org/abs/2205.15495v1
- Date: Tue, 31 May 2022 01:19:18 GMT
- Title: Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking
- Authors: Peng Dai and Yiqiang Feng and Renliang Weng and Changshui Zhang
- Abstract summary: We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
- Score: 59.79252390626194
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recent trend in multiple object tracking (MOT) is heading towards
leveraging deep learning to boost the tracking performance. In this paper, we
propose a novel solution named TransSTAM, which leverages Transformer to
effectively model both the appearance features of each object and the
spatial-temporal relationships among objects. TransSTAM consists of two major
parts: (1) The encoder utilizes the powerful self-attention mechanism of
Transformer to learn discriminative features for each tracklet; (2) The decoder
adopts the standard cross-attention mechanism to model the affinities between
the tracklets and the detections by taking both spatial-temporal and appearance
features into account. TransSTAM has two major advantages: (1) It is solely
based on the encoder-decoder architecture and enjoys a compact network design,
hence being computationally efficient; (2) It can effectively learn
spatial-temporal and appearance features within one model, hence achieving
better tracking accuracy. The proposed method is evaluated on multiple public
benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear
performance improvement in both IDF1 and HOTA with respect to previous
state-of-the-art approaches on all the benchmarks. Our code is available at
\url{https://github.com/icicle4/TranSTAM}.
Related papers
- Deciphering Movement: Unified Trajectory Generation Model for Multi-Agent [53.637837706712794]
We propose a Unified Trajectory Generation model, UniTraj, that processes arbitrary trajectories as masked inputs.
Specifically, we introduce a Ghost Spatial Masking (GSM) module embedded within a Transformer encoder for spatial feature extraction.
We benchmark three practical sports game datasets, Basketball-U, Football-U, and Soccer-U, for evaluation.
arXiv Detail & Related papers (2024-05-27T22:15:23Z) - Transformer Network for Multi-Person Tracking and Re-Identification in
Unconstrained Environment [0.6798775532273751]
Multi-object tracking (MOT) has profound applications in a variety of fields, including surveillance, sports analytics, self-driving, and cooperative robotics.
We put forward an integrated MOT method that marries object detection and identity linkage within a singular, end-to-end trainable framework.
Our system leverages a robust memory-temporal memory module that retains extensive historical observations and effectively encodes them using an attention-based aggregator.
arXiv Detail & Related papers (2023-12-19T08:15:22Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition [50.064502884594376]
We study the problem of human action recognition using motion capture (MoCap) sequences.
We propose a novel Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences.
The proposed method achieves state-of-the-art performance compared to skeleton-based and point-cloud-based models.
arXiv Detail & Related papers (2023-03-31T16:19:27Z) - Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream.
At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank.
To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z) - SMILEtrack: SiMIlarity LEarning for Occlusion-Aware Multiple Object
Tracking [20.286114226299237]
This paper introduces SMILEtrack, an innovative object tracker with a Siamese network-based Similarity Learning Module (SLM)
The SLM calculates the appearance similarity between two objects, overcoming the limitations of feature descriptors in Separate Detection and Embedding models.
Second, we develop a Similarity Matching Cascade (SMC) module with a novel GATE function for robust object matching across consecutive video frames.
arXiv Detail & Related papers (2022-11-16T10:49:48Z) - An Improved End-to-End Multi-Target Tracking Method Based on Transformer
Self-Attention [24.17627001939523]
This study proposes an improved end-to-end multi-target tracking algorithm.
It adapts to multi-view multi-scale scenes based on the self-attentive mechanism of the transformer's encoder-decoder structure.
arXiv Detail & Related papers (2022-11-11T04:58:46Z) - TrTr: Visual Tracking with Transformer [29.415900191169587]
We propose a novel tracker network based on a powerful attention mechanism called Transformer encoder-decoder architecture.
We design the classification and regression heads using the output of Transformer to localize target based on shape-agnostic anchor.
Our method performs favorably against state-of-the-art algorithms.
arXiv Detail & Related papers (2021-05-09T02:32:28Z) - TransMOT: Spatial-Temporal Graph Transformer for Multiple Object
Tracking [74.82415271960315]
We propose a solution named TransMOT to efficiently model the spatial and temporal interactions among objects in a video.
TransMOT is not only more computationally efficient than the traditional Transformer, but it also achieves better tracking accuracy.
The proposed method is evaluated on multiple benchmark datasets including MOT15, MOT16, MOT17, and MOT20.
arXiv Detail & Related papers (2021-04-01T01:49:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.