Compact Transformer Tracker with Correlative Masked Modeling
- URL: http://arxiv.org/abs/2301.10938v1
- Date: Thu, 26 Jan 2023 04:58:08 GMT
- Title: Compact Transformer Tracker with Correlative Masked Modeling
- Authors: Zikai Song, Run Luo, Junqing Yu, Yi-Ping Phoebe Chen, Wei Yang
- Abstract summary: Transformer framework has been showing superior performances in visual object tracking.
Recent advances focus on exploring attention mechanism variants for better information aggregation.
In this paper, we prove that the vanilla self-attention structure is sufficient for information aggregation.
- Score: 16.234426179567837
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Transformer framework has been showing superior performances in visual object
tracking for its great strength in information aggregation across the template
and search image with the well-known attention mechanism. Most recent advances
focus on exploring attention mechanism variants for better information
aggregation. We find these schemes are equivalent to or even just a subset of
the basic self-attention mechanism. In this paper, we prove that the vanilla
self-attention structure is sufficient for information aggregation, and
structural adaption is unnecessary. The key is not the attention structure, but
how to extract the discriminative feature for tracking and enhance the
communication between the target and search image. Based on this finding, we
adopt the basic vision transformer (ViT) architecture as our main tracker and
concatenate the template and search image for feature embedding. To guide the
encoder to capture the invariant feature for tracking, we attach a lightweight
correlative masked decoder which reconstructs the original template and search
image from the corresponding masked tokens. The correlative masked decoder
serves as a plugin for the compact transform tracker and is skipped in
inference. Our compact tracker uses the most simple structure which only
consists of a ViT backbone and a box head, and can run at 40 fps. Extensive
experiments show the proposed compact transform tracker outperforms existing
approaches, including advanced attention variants, and demonstrates the
sufficiency of self-attention in tracking tasks. Our method achieves
state-of-the-art performance on five challenging datasets, along with the
VOT2020, UAV123, LaSOT, TrackingNet, and GOT-10k benchmarks. Our project is
available at https://github.com/HUSTDML/CTTrack.
Related papers
- Separable Self and Mixed Attention Transformers for Efficient Object
Tracking [3.9160947065896803]
This paper proposes an efficient self and mixed attention transformer-based architecture for lightweight tracking.
With these contributions, the proposed lightweight tracker deploys a transformer-based backbone and head module concurrently for the first time.
Simulations show that our Separable Self and Mixed Attention-based Tracker, SMAT, surpasses the performance of related lightweight trackers on GOT10k, TrackingNet, LaSOT, NfS30, UAV123, and AVisT datasets.
arXiv Detail & Related papers (2023-09-07T19:23:02Z) - End-to-end Tracking with a Multi-query Transformer [96.13468602635082]
Multiple-object tracking (MOT) is a challenging task that requires simultaneous reasoning about location, appearance, and identity of the objects in the scene over time.
Our aim in this paper is to move beyond tracking-by-detection approaches, to class-agnostic tracking that performs well also for unknown object classes.
arXiv Detail & Related papers (2022-10-26T10:19:37Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - SparseTT: Visual Tracking with Sparse Transformers [43.1666514605021]
Self-attention mechanism designed to model long-range dependencies is the key to the success of Transformers.
In this paper, we relieve this issue with a sparse attention mechanism by focusing the most relevant information in the search regions.
We introduce a double-head predictor to boost the accuracy of foreground-background classification and regression of target bounding boxes.
arXiv Detail & Related papers (2022-05-08T04:00:28Z) - Efficient Visual Tracking with Exemplar Transformers [98.62550635320514]
We introduce the Exemplar Transformer, an efficient transformer for real-time visual object tracking.
E.T.Track, our visual tracker that incorporates Exemplar Transformer layers, runs at 47 fps on a CPU.
This is up to 8 times faster than other transformer-based models.
arXiv Detail & Related papers (2021-12-17T18:57:54Z) - Learning Dynamic Compact Memory Embedding for Deformable Visual Object
Tracking [82.34356879078955]
We propose a compact memory embedding to enhance the discrimination of the segmentation-based deformable visual tracking method.
Our method outperforms the excellent segmentation-based trackers, i.e., D3S and SiamMask on DAVIS 2017 benchmark.
arXiv Detail & Related papers (2021-11-23T03:07:12Z) - TrTr: Visual Tracking with Transformer [29.415900191169587]
We propose a novel tracker network based on a powerful attention mechanism called Transformer encoder-decoder architecture.
We design the classification and regression heads using the output of Transformer to localize target based on shape-agnostic anchor.
Our method performs favorably against state-of-the-art algorithms.
arXiv Detail & Related papers (2021-05-09T02:32:28Z) - Transformer Tracking [76.96796612225295]
Correlation acts as a critical role in the tracking field, especially in popular Siamese-based trackers.
This work presents a novel attention-based feature fusion network, which effectively combines the template and search region features solely using attention.
Experiments show that our TransT achieves very promising results on six challenging datasets.
arXiv Detail & Related papers (2021-03-29T09:06:55Z) - TrackFormer: Multi-Object Tracking with Transformers [92.25832593088421]
TrackFormer is an end-to-end multi-object tracking and segmentation model based on an encoder-decoder Transformer architecture.
New track queries are spawned by the DETR object detector and embed the position of their corresponding object over time.
TrackFormer achieves a seamless data association between frames in a new tracking-by-attention paradigm.
arXiv Detail & Related papers (2021-01-07T18:59:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.