Siamese Transformer Pyramid Networks for Real-Time UAV Tracking
- URL: http://arxiv.org/abs/2110.08822v1
- Date: Sun, 17 Oct 2021 13:48:31 GMT
- Title: Siamese Transformer Pyramid Networks for Real-Time UAV Tracking
- Authors: Daitao Xing, Nikolaos Evangeliou, Athanasios Tsoukalas and Anthony
Tzes
- Abstract summary: We introduce the Siamese Transformer Pyramid Network (SiamTPN), which inherits the advantages from both CNN and Transformer architectures.
Experiments on both aerial and prevalent tracking benchmarks achieve competitive results while operating at high speed.
Our fastest variant tracker operates over 30 Hz on a single CPU-core and obtaining an AUC score of 58.1% on the LaSOT dataset.
- Score: 3.0969191504482243
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent object tracking methods depend upon deep networks or convoluted
architectures. Most of those trackers can hardly meet real-time processing
requirements on mobile platforms with limited computing resources. In this
work, we introduce the Siamese Transformer Pyramid Network (SiamTPN), which
inherits the advantages from both CNN and Transformer architectures.
Specifically, we exploit the inherent feature pyramid of a lightweight network
(ShuffleNetV2) and reinforce it with a Transformer to construct a robust
target-specific appearance model. A centralized architecture with lateral cross
attention is developed for building augmented high-level feature maps. To avoid
the computation and memory intensity while fusing pyramid representations with
the Transformer, we further introduce the pooling attention module, which
significantly reduces memory and time complexity while improving the
robustness. Comprehensive experiments on both aerial and prevalent tracking
benchmarks achieve competitive results while operating at high speed,
demonstrating the effectiveness of SiamTPN. Moreover, our fastest variant
tracker operates over 30 Hz on a single CPU-core and obtaining an AUC score of
58.1% on the LaSOT dataset. Source codes are available at
https://github.com/RISCNYUAD/SiamTPNTracker
Related papers
- Correlation-Embedded Transformer Tracking: A Single-Branch Framework [69.0798277313574]
We propose a novel single-branch tracking framework inspired by the transformer.
Unlike the Siamese-like feature extraction, our tracker deeply embeds cross-image feature correlation in multiple layers of the feature network.
The output features can be directly used for predicting target locations without additional correlation steps.
arXiv Detail & Related papers (2024-01-23T13:20:57Z) - Separable Self and Mixed Attention Transformers for Efficient Object
Tracking [3.9160947065896803]
This paper proposes an efficient self and mixed attention transformer-based architecture for lightweight tracking.
With these contributions, the proposed lightweight tracker deploys a transformer-based backbone and head module concurrently for the first time.
Simulations show that our Separable Self and Mixed Attention-based Tracker, SMAT, surpasses the performance of related lightweight trackers on GOT10k, TrackingNet, LaSOT, NfS30, UAV123, and AVisT datasets.
arXiv Detail & Related papers (2023-09-07T19:23:02Z) - Exploring Lightweight Hierarchical Vision Transformers for Efficient
Visual Tracking [69.89887818921825]
HiT is a new family of efficient tracking models that can run at high speed on different devices.
HiT achieves 64.6% AUC on the LaSOT benchmark, surpassing all previous efficient trackers.
arXiv Detail & Related papers (2023-08-14T02:51:34Z) - Efficient Joint Detection and Multiple Object Tracking with Spatially
Aware Transformer [0.8808021343665321]
We propose a light-weight and highly efficient Joint Detection and Tracking pipeline for the task of Multi-Object Tracking.
It is driven by a transformer based backbone instead of CNN, which is highly scalable with the input resolution.
As a result of our modifications, we reduce the overall model size of TransTrack by 58.73% and the complexity by 78.72%.
arXiv Detail & Related papers (2022-11-09T07:19:33Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - Efficient Visual Tracking via Hierarchical Cross-Attention Transformer [82.92565582642847]
We present an efficient tracking method via a hierarchical cross-attention transformer named HCAT.
Our model runs about 195 fps on GPU, 45 fps on CPU, and 55 fps on the edge AI platform of NVidia Jetson AGX Xavier.
arXiv Detail & Related papers (2022-03-25T09:45:27Z) - LiteTransformerSearch: Training-free On-device Search for Efficient
Autoregressive Language Models [34.673688610935876]
We show that the latency and perplexity pareto-frontier can be found without need for any model training.
We evaluate our method, dubbed Lightweight Transformer Search (LTS), on diverse devices.
We show that the perplexity of Transformer-XL can be achieved with up to 2x lower latency.
arXiv Detail & Related papers (2022-03-04T02:10:43Z) - Efficient Visual Tracking with Exemplar Transformers [98.62550635320514]
We introduce the Exemplar Transformer, an efficient transformer for real-time visual object tracking.
E.T.Track, our visual tracker that incorporates Exemplar Transformer layers, runs at 47 fps on a CPU.
This is up to 8 times faster than other transformer-based models.
arXiv Detail & Related papers (2021-12-17T18:57:54Z) - Trident Pyramid Networks: The importance of processing at the feature
pyramid level for better object detection [50.008529403150206]
We present a new core architecture called Trident Pyramid Network (TPN)
TPN allows for a deeper design and for a better balance between communication-based processing and self-processing.
We show consistent improvements when using our TPN core on the object detection benchmark, outperforming the popular BiFPN baseline by 1.5 AP.
arXiv Detail & Related papers (2021-10-08T09:59:59Z) - Searching for Efficient Multi-Stage Vision Transformers [42.0565109812926]
Vision Transformer (ViT) demonstrates that Transformer for natural language processing can be applied to computer vision tasks.
ViT-ResNAS is an efficient multi-stage ViT architecture designed with neural architecture search (NAS)
arXiv Detail & Related papers (2021-09-01T22:37:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.