MixFormer: End-to-End Tracking with Iterative Mixed Attention
- URL: http://arxiv.org/abs/2302.02814v2
- Date: Thu, 9 Feb 2023 18:15:41 GMT
- Title: MixFormer: End-to-End Tracking with Iterative Mixed Attention
- Authors: Yutao Cui, Cheng Jiang, Gangshan Wu and Limin Wang
- Abstract summary: We present a compact tracking framework, termed as MixFormer, built upon transformers.
We propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration.
Our MixFormer trackers set a new state-of-the-art performance on seven tracking benchmarks.
- Score: 47.78513247048846
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual object tracking often employs a multi-stage pipeline of feature
extraction, target information integration, and bounding box estimation. To
simplify this pipeline and unify the process of feature extraction and target
information integration, in this paper, we present a compact tracking
framework, termed as MixFormer, built upon transformers. Our core design is to
utilize the flexibility of attention operations, and propose a Mixed Attention
Module (MAM) for simultaneous feature extraction and target information
integration. This synchronous modeling scheme allows to extract target-specific
discriminative features and perform extensive communication between target and
search area. Based on MAM, we build our MixFormer trackers simply by stacking
multiple MAMs and placing a localization head on top. Specifically, we
instantiate two types of MixFormer trackers, a hierarchical tracker MixCvT, and
a non-hierarchical tracker MixViT. For these two trackers, we investigate a
series of pre-training methods and uncover the different behaviors between
supervised pre-training and self-supervised pre-training in our MixFormer
trackers. We also extend the masked pre-training to our MixFormer trackers and
design the competitive TrackMAE pre-training technique. Finally, to handle
multiple target templates during online tracking, we devise an asymmetric
attention scheme in MAM to reduce computational cost, and propose an effective
score prediction module to select high-quality templates. Our MixFormer
trackers set a new state-of-the-art performance on seven tracking benchmarks,
including LaSOT, TrackingNet, VOT2020, GOT-10k, OTB100 and UAV123. In
particular, our MixViT-L achieves AUC score of 73.3% on LaSOT, 86.1% on
TrackingNet, EAO of 0.584 on VOT2020, and AO of 75.7% on GOT-10k. Code and
trained models are publicly available at https://github.com/MCG-NJU/MixFormer.
Related papers
- HSTrack: Bootstrap End-to-End Multi-Camera 3D Multi-object Tracking with Hybrid Supervision [34.7347336548199]
In camera-based 3D multi-object tracking (MOT), the prevailing methods follow the tracking-by-query-propagation paradigm.
We present HSTrack, a novel plug-and-play method designed to co-facilitate multi-task learning for detection and tracking.
arXiv Detail & Related papers (2024-11-11T08:18:49Z) - MixFormerV2: Efficient Fully Transformer Tracking [49.07428299165031]
Transformer-based trackers have achieved strong accuracy on the standard benchmarks.
But their efficiency remains an obstacle to practical deployment on both GPU and CPU platforms.
We propose a fully transformer tracking framework, coined as emphMixFormerV2, without any dense convolutional operation and complex score prediction module.
arXiv Detail & Related papers (2023-05-25T09:50:54Z) - OmniTracker: Unifying Object Tracking by Tracking-with-Detection [119.51012668709502]
OmniTracker is presented to resolve all the tracking tasks with a fully shared network architecture, model weights, and inference pipeline.
Experiments on 7 tracking datasets, including LaSOT, TrackingNet, DAVIS16-17, MOT17, MOTS20, and YTVIS19, demonstrate that OmniTracker achieves on-par or even better results than both task-specific and unified tracking models.
arXiv Detail & Related papers (2023-03-21T17:59:57Z) - SMILEtrack: SiMIlarity LEarning for Occlusion-Aware Multiple Object
Tracking [20.286114226299237]
This paper introduces SMILEtrack, an innovative object tracker with a Siamese network-based Similarity Learning Module (SLM)
The SLM calculates the appearance similarity between two objects, overcoming the limitations of feature descriptors in Separate Detection and Embedding models.
Second, we develop a Similarity Matching Cascade (SMC) module with a novel GATE function for robust object matching across consecutive video frames.
arXiv Detail & Related papers (2022-11-16T10:49:48Z) - Unified Transformer Tracker for Object Tracking [58.65901124158068]
We present the Unified Transformer Tracker (UTT) to address tracking problems in different scenarios with one paradigm.
A track transformer is developed in our UTT to track the target in both Single Object Tracking (SOT) and Multiple Object Tracking (MOT)
arXiv Detail & Related papers (2022-03-29T01:38:49Z) - MixFormer: End-to-End Tracking with Iterative Mixed Attention [47.37548708021754]
We present a compact tracking framework, termed as em MixFormer, built upon transformers.
Our core design is to utilize the flexibility of attention operations, and propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration.
Our MixFormer sets a new state-of-the-art performance on five tracking benchmarks, including LaSOT, TrackingNet, VOT2020, GOT-10k, and UAV123.
arXiv Detail & Related papers (2022-03-21T16:04:21Z) - Tracking by Instance Detection: A Meta-Learning Approach [99.66119903655711]
We propose a principled three-step approach to build a high-performance tracker.
We build two trackers, named Retina-MAML and FCOS-MAML, based on two modern detectors RetinaNet and FCOS.
Both trackers run in real-time at 40 FPS.
arXiv Detail & Related papers (2020-04-02T05:55:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.