MixFormer: End-to-End Tracking with Iterative Mixed Attention
- URL: http://arxiv.org/abs/2203.11082v1
- Date: Mon, 21 Mar 2022 16:04:21 GMT
- Title: MixFormer: End-to-End Tracking with Iterative Mixed Attention
- Authors: Yutao Cui, Jiang Cheng, Limin Wang and Gangshan Wu
- Abstract summary: We present a compact tracking framework, termed as em MixFormer, built upon transformers.
Our core design is to utilize the flexibility of attention operations, and propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration.
Our MixFormer sets a new state-of-the-art performance on five tracking benchmarks, including LaSOT, TrackingNet, VOT2020, GOT-10k, and UAV123.
- Score: 47.37548708021754
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tracking often uses a multi-stage pipeline of feature extraction, target
information integration, and bounding box estimation. To simplify this pipeline
and unify the process of feature extraction and target information integration,
we present a compact tracking framework, termed as {\em MixFormer}, built upon
transformers. Our core design is to utilize the flexibility of attention
operations, and propose a Mixed Attention Module (MAM) for simultaneous feature
extraction and target information integration. This synchronous modeling scheme
allows to extract target-specific discriminative features and perform extensive
communication between target and search area. Based on MAM, we build our
MixFormer tracking framework simply by stacking multiple MAMs with progressive
patch embedding and placing a localization head on top. In addition, to handle
multiple target templates during online tracking, we devise an asymmetric
attention scheme in MAM to reduce computational cost, and propose an effective
score prediction module to select high-quality templates. Our MixFormer sets a
new state-of-the-art performance on five tracking benchmarks, including LaSOT,
TrackingNet, VOT2020, GOT-10k, and UAV123. In particular, our MixFormer-L
achieves NP score of 79.9 on LaSOT, 88.9 on TrackingNet and EAO of 0.555 on
VOT2020. We also perform in-depth ablation studies to demonstrate the
effectiveness of simultaneous feature extraction and information integration.
Code and trained models are publicly available at
\href{https://github.com/MCG-NJU/MixFormer}{https://github.com/MCG-NJU/MixFormer}.
Related papers
- Staged Depthwise Correlation and Feature Fusion for Siamese Object
Tracking [0.6827423171182154]
We propose a novel staged depthwise correlation and feature fusion network, named DCFFNet, to further optimize the feature extraction for visual tracking.
We build our deep tracker upon a siamese network architecture, which is offline trained from scratch on multiple large-scale datasets.
For comprehensive evaluations of performance, we implement our tracker on the popular benchmarks, including OTB100, VOT2018 and LaSOT.
arXiv Detail & Related papers (2023-10-15T06:04:42Z) - MixFormerV2: Efficient Fully Transformer Tracking [49.07428299165031]
Transformer-based trackers have achieved strong accuracy on the standard benchmarks.
But their efficiency remains an obstacle to practical deployment on both GPU and CPU platforms.
We propose a fully transformer tracking framework, coined as emphMixFormerV2, without any dense convolutional operation and complex score prediction module.
arXiv Detail & Related papers (2023-05-25T09:50:54Z) - MixFormer: End-to-End Tracking with Iterative Mixed Attention [47.78513247048846]
We present a compact tracking framework, termed as MixFormer, built upon transformers.
We propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration.
Our MixFormer trackers set a new state-of-the-art performance on seven tracking benchmarks.
arXiv Detail & Related papers (2023-02-06T14:38:09Z) - OST: Efficient One-stream Network for 3D Single Object Tracking in Point Clouds [6.661881950861012]
We propose a novel one-stream network with the strength of the instance-level encoding, which avoids the correlation operations occurring in previous Siamese network.
The proposed method has achieved considerable performance not only for class-specific tracking but also for class-agnostic tracking with less computation and higher efficiency.
arXiv Detail & Related papers (2022-10-16T12:31:59Z) - Joint Feature Learning and Relation Modeling for Tracking: A One-Stream
Framework [76.70603443624012]
We propose a novel one-stream tracking (OSTrack) framework that unifies feature learning and relation modeling.
In this way, discriminative target-oriented features can be dynamically extracted by mutual guidance.
OSTrack achieves state-of-the-art performance on multiple benchmarks, in particular, it shows impressive results on the one-shot tracking benchmark GOT-10k.
arXiv Detail & Related papers (2022-03-22T18:37:11Z) - Harnessing Hard Mixed Samples with Decoupled Regularizer [69.98746081734441]
Mixup is an efficient data augmentation approach that improves the generalization of neural networks by smoothing the decision boundary with mixed data.
In this paper, we propose an efficient mixup objective function with a decoupled regularizer named Decoupled Mixup (DM)
DM can adaptively utilize hard mixed samples to mine discriminative features without losing the original smoothness of mixup.
arXiv Detail & Related papers (2022-03-21T07:12:18Z) - Multiple target tracking with interaction using an MCMC MRF Particle
Filter [0.0]
This paper presents and discusses an implementation of a multiple target tracking method.
The referenced approach uses a Markov Chain Monte Carlo (MCMC) sampling step to evaluate the filter and constructs an efficient proposal density to generate new samples.
It is shown that the implemented approach of modeling target interactions using MRF successfully corrects many of the tracking errors made by the independent, interaction unaware, particle filters.
arXiv Detail & Related papers (2021-11-25T17:32:50Z) - Transformer Tracking [76.96796612225295]
Correlation acts as a critical role in the tracking field, especially in popular Siamese-based trackers.
This work presents a novel attention-based feature fusion network, which effectively combines the template and search region features solely using attention.
Experiments show that our TransT achieves very promising results on six challenging datasets.
arXiv Detail & Related papers (2021-03-29T09:06:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.