Exploring Lightweight Hierarchical Vision Transformers for Efficient
Visual Tracking
- URL: http://arxiv.org/abs/2308.06904v1
- Date: Mon, 14 Aug 2023 02:51:34 GMT
- Title: Exploring Lightweight Hierarchical Vision Transformers for Efficient
Visual Tracking
- Authors: Ben Kang, Xin Chen, Dong Wang, Houwen Peng and Huchuan Lu
- Abstract summary: HiT is a new family of efficient tracking models that can run at high speed on different devices.
HiT achieves 64.6% AUC on the LaSOT benchmark, surpassing all previous efficient trackers.
- Score: 69.89887818921825
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based visual trackers have demonstrated significant progress
owing to their superior modeling capabilities. However, existing trackers are
hampered by low speed, limiting their applicability on devices with limited
computational power. To alleviate this problem, we propose HiT, a new family of
efficient tracking models that can run at high speed on different devices while
retaining high performance. The central idea of HiT is the Bridge Module, which
bridges the gap between modern lightweight transformers and the tracking
framework. The Bridge Module incorporates the high-level information of deep
features into the shallow large-resolution features. In this way, it produces
better features for the tracking head. We also propose a novel dual-image
position encoding technique that simultaneously encodes the position
information of both the search region and template images. The HiT model
achieves promising speed with competitive performance. For instance, it runs at
61 frames per second (fps) on the Nvidia Jetson AGX edge device. Furthermore,
HiT attains 64.6% AUC on the LaSOT benchmark, surpassing all previous efficient
trackers.
Related papers
- Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision Transformers [56.37495946212932]
Vision transformers (ViTs) have demonstrated their superior accuracy for computer vision tasks compared to convolutional neural networks (CNNs)
This work proposes Quasar-ViT, a hardware-oriented quantization-aware architecture search framework for ViTs.
arXiv Detail & Related papers (2024-07-25T16:35:46Z) - ZoomTrack: Target-aware Non-uniform Resizing for Efficient Visual
Tracking [40.13014036490452]
transformer has enabled the speed-oriented trackers to approach state-of-the-art (SOTA) performance with high-speed.
We demonstrate that it is possible to narrow or even close this gap while achieving high tracking speed based on the smaller input size.
arXiv Detail & Related papers (2023-10-16T05:06:13Z) - LiteTrack: Layer Pruning with Asynchronous Feature Extraction for
Lightweight and Efficient Visual Tracking [4.179339279095506]
LiteTrack is an efficient transformer-based tracking model optimized for high-speed operations across various devices.
It achieves a more favorable trade-off between accuracy and efficiency than the other lightweight trackers.
LiteTrack-B9 reaches competitive 72.2% AO on GOT-10k and 82.4% AUC on TrackingNet, and operates at 171 fps on an NVIDIA 2080Ti GPU.
arXiv Detail & Related papers (2023-09-17T12:01:03Z) - Data-Model-Circuit Tri-Design for Ultra-Light Video Intelligence on Edge
Devices [90.30316433184414]
We propose a data-model-hardware tri-design framework for high- throughput, low-cost, and high-accuracy MOT on HD video stream.
Compared to the state-of-the-art MOT baseline, our tri-design approach can achieve 12.5x latency reduction, 20.9x effective frame rate improvement, 5.83x lower power, and 9.78x better energy efficiency, without much accuracy drop.
arXiv Detail & Related papers (2022-10-16T16:21:40Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - Efficient Visual Tracking via Hierarchical Cross-Attention Transformer [82.92565582642847]
We present an efficient tracking method via a hierarchical cross-attention transformer named HCAT.
Our model runs about 195 fps on GPU, 45 fps on CPU, and 55 fps on the edge AI platform of NVidia Jetson AGX Xavier.
arXiv Detail & Related papers (2022-03-25T09:45:27Z) - Efficient Visual Tracking with Exemplar Transformers [98.62550635320514]
We introduce the Exemplar Transformer, an efficient transformer for real-time visual object tracking.
E.T.Track, our visual tracker that incorporates Exemplar Transformer layers, runs at 47 fps on a CPU.
This is up to 8 times faster than other transformer-based models.
arXiv Detail & Related papers (2021-12-17T18:57:54Z) - Siamese Transformer Pyramid Networks for Real-Time UAV Tracking [3.0969191504482243]
We introduce the Siamese Transformer Pyramid Network (SiamTPN), which inherits the advantages from both CNN and Transformer architectures.
Experiments on both aerial and prevalent tracking benchmarks achieve competitive results while operating at high speed.
Our fastest variant tracker operates over 30 Hz on a single CPU-core and obtaining an AUC score of 58.1% on the LaSOT dataset.
arXiv Detail & Related papers (2021-10-17T13:48:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.