An Improved End-to-End Multi-Target Tracking Method Based on Transformer
Self-Attention
- URL: http://arxiv.org/abs/2211.06001v1
- Date: Fri, 11 Nov 2022 04:58:46 GMT
- Title: An Improved End-to-End Multi-Target Tracking Method Based on Transformer
Self-Attention
- Authors: Yong Hong, Deren Li, Shupei Luo, Xin Chen, Yi Yang, Mi Wang
- Abstract summary: This study proposes an improved end-to-end multi-target tracking algorithm.
It adapts to multi-view multi-scale scenes based on the self-attentive mechanism of the transformer's encoder-decoder structure.
- Score: 24.17627001939523
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study proposes an improved end-to-end multi-target tracking algorithm
that adapts to multi-view multi-scale scenes based on the self-attentive
mechanism of the transformer's encoder-decoder structure. A multi-dimensional
feature extraction backbone network is combined with a self-built semantic
raster map, which is stored in the encoder for correlation and generates target
position encoding and multi-dimensional feature vectors. The decoder
incorporates four methods: spatial clustering and semantic filtering of
multi-view targets, dynamic matching of multi-dimensional features, space-time
logic-based multi-target tracking, and space-time convergence network
(STCN)-based parameter passing. Through the fusion of multiple decoding
methods, muti-camera targets are tracked in three dimensions: temporal logic,
spatial logic, and feature matching. For the MOT17 dataset, this study's method
significantly outperforms the current state-of-the-art method MiniTrackV2 [49]
by 2.2% to 0.836 on Multiple Object Tracking Accuracy(MOTA) metric.
Furthermore, this study proposes a retrospective mechanism for the first time,
and adopts a reverse-order processing method to optimise the historical
mislabeled targets for improving the Identification F1-score(IDF1). For the
self-built dataset OVIT-MOT01, the IDF1 improves from 0.948 to 0.967, and the
Multi-camera Tracking Accuracy(MCTA) improves from 0.878 to 0.909, which
significantly improves the continuous tracking accuracy and scene adaptation.
This research method introduces a new attentional tracking paradigm which is
able to achieve state-of-the-art performance on multi-target tracking (MOT17
and OVIT-MOT01) tasks.
Related papers
- Real-time Multi-Object Tracking Based on Bi-directional Matching [0.0]
This study offers a bi-directional matching algorithm for multi-object tracking.
A stranded area is used in the matching algorithm to temporarily store the objects that fail to be tracked.
In the MOT17 challenge, the proposed algorithm achieves 63.4% MOTA, 55.3% IDF1, and 20.1 FPS tracking speed.
arXiv Detail & Related papers (2023-03-15T08:38:08Z) - Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream.
At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank.
To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z) - 3DMODT: Attention-Guided Affinities for Joint Detection & Tracking in 3D
Point Clouds [95.54285993019843]
We propose a method for joint detection and tracking of multiple objects in 3D point clouds.
Our model exploits temporal information employing multiple frames to detect objects and track them in a single network.
arXiv Detail & Related papers (2022-11-01T20:59:38Z) - Transformer-based assignment decision network for multiple object
tracking [0.0]
We introduce Transformer-based Assignment Decision Network (TADN) that tackles data association without the need of explicit optimization during inference.
Our proposed approach outperforms the state-of-the-art in most evaluation metrics despite its simple nature as a tracker.
arXiv Detail & Related papers (2022-08-06T19:47:32Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - Learning Dynamic Compact Memory Embedding for Deformable Visual Object
Tracking [82.34356879078955]
We propose a compact memory embedding to enhance the discrimination of the segmentation-based deformable visual tracking method.
Our method outperforms the excellent segmentation-based trackers, i.e., D3S and SiamMask on DAVIS 2017 benchmark.
arXiv Detail & Related papers (2021-11-23T03:07:12Z) - Multi-object Tracking with Tracked Object Bounding Box Association [18.539658212171062]
CenterTrack tracking algorithm achieves state-of-the-art tracking performance using a simple detection model and single-frame spatial offsets.
We propose to incorporate a simple tracked object bounding box and overlapping prediction based on the current frame onto the CenterTrack algorithm.
arXiv Detail & Related papers (2021-05-17T14:32:47Z) - RelationTrack: Relation-aware Multiple Object Tracking with Decoupled
Representation [3.356734463419838]
Existing online multiple object tracking (MOT) algorithms often consist of two subtasks, detection and re-identification (ReID)
In order to enhance the inference speed and reduce the complexity, current methods commonly integrate these double subtasks into a unified framework.
We devise a module named Global Context Disentangling (GCD) that decouples the learned representation into detection-specific and ReID-specific embeddings.
To resolve this restriction, we develop a module, referred to as Guided Transformer (GTE), by combining the powerful reasoning ability of Transformer encoder and deformable attention.
arXiv Detail & Related papers (2021-05-10T13:00:40Z) - TransMOT: Spatial-Temporal Graph Transformer for Multiple Object
Tracking [74.82415271960315]
We propose a solution named TransMOT to efficiently model the spatial and temporal interactions among objects in a video.
TransMOT is not only more computationally efficient than the traditional Transformer, but it also achieves better tracking accuracy.
The proposed method is evaluated on multiple benchmark datasets including MOT15, MOT16, MOT17, and MOT20.
arXiv Detail & Related papers (2021-04-01T01:49:05Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.