Backbone is All Your Need: A Simplified Architecture for Visual Object
Tracking
- URL: http://arxiv.org/abs/2203.05328v1
- Date: Thu, 10 Mar 2022 12:20:58 GMT
- Title: Backbone is All Your Need: A Simplified Architecture for Visual Object
Tracking
- Authors: Boyu Chen, Peixia Li, Lei Bai, Lei Qiao, Qiuhong Shen, Bo Li, Weihao
Gan, Wei Wu, Wanli Ouyang
- Abstract summary: Existing tracking approaches rely on customized sub-modules and need prior knowledge for architecture selection.
This paper presents a simplified tracking architecture (SimTrack) by leveraging a transformer backbone for joint feature extraction and interaction.
Our SimTrack improves the baseline with 2.5%/2.6% AUC gains on LaSOT/TNL2K and gets results competitive with other specialized tracking algorithms without bells and whistles.
- Score: 69.08903927311283
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Exploiting a general-purpose neural architecture to replace hand-wired
designs or inductive biases has recently drawn extensive interest. However,
existing tracking approaches rely on customized sub-modules and need prior
knowledge for architecture selection, hindering the tracking development in a
more general system. This paper presents a Simplified Tracking architecture
(SimTrack) by leveraging a transformer backbone for joint feature extraction
and interaction. Unlike existing Siamese trackers, we serialize the input
images and concatenate them directly before the one-branch backbone. Feature
interaction in the backbone helps to remove well-designed interaction modules
and produce a more efficient and effective framework. To reduce the information
loss from down-sampling in vision transformers, we further propose a foveal
window strategy, providing more diverse input patches with acceptable
computational costs. Our SimTrack improves the baseline with 2.5%/2.6% AUC
gains on LaSOT/TNL2K and gets results competitive with other specialized
tracking algorithms without bells and whistles.
Related papers
- Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - VisionTraj: A Noise-Robust Trajectory Recovery Framework based on
Large-scale Camera Network [18.99662554949384]
Trajectory recovery based on snapshots from the city-wide multi-camera network facilitates urban mobility sensing and driveway optimization.
This paper proposes VisionTraj, the first learning-based model that reconstructs vehicle trajectories from snapshots recorded by road network cameras.
arXiv Detail & Related papers (2023-12-11T14:52:43Z) - UniTR: A Unified and Efficient Multi-Modal Transformer for
Bird's-Eye-View Representation [113.35352122662752]
We present an efficient multi-modal backbone for outdoor 3D perception named UniTR.
UniTR processes a variety of modalities with unified modeling and shared parameters.
UniTR is also a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks.
arXiv Detail & Related papers (2023-08-15T12:13:44Z) - All in One: Exploring Unified Vision-Language Tracking with Multi-Modal
Alignment [23.486297020327257]
Current vision-language (VL) tracking framework consists of three parts, ie a visual feature extractor, a language feature extractor, and a fusion model.
We propose an All-in-One framework, which learns joint feature extraction and interaction by adopting a unified transformer backbone.
arXiv Detail & Related papers (2023-07-07T03:51:21Z) - RIFormer: Keep Your Vision Backbone Effective While Removing Token Mixer [95.71132572688143]
This paper studies how to keep a vision backbone effective while removing token mixers in its basic building blocks.
Token mixers, as self-attention for vision transformers (ViTs), are intended to perform information communication between different spatial tokens but suffer from considerable computational cost and latency.
arXiv Detail & Related papers (2023-04-12T07:34:13Z) - Learning Target-aware Representation for Visual Tracking via Informative
Interactions [49.552877881662475]
We introduce a novel backbone architecture to improve target-perception ability of feature representation for tracking.
The proposed GIM module and InBN mechanism are general and applicable to different backbone types including CNN and Transformer.
arXiv Detail & Related papers (2022-01-07T16:22:27Z) - TrTr: Visual Tracking with Transformer [29.415900191169587]
We propose a novel tracker network based on a powerful attention mechanism called Transformer encoder-decoder architecture.
We design the classification and regression heads using the output of Transformer to localize target based on shape-agnostic anchor.
Our method performs favorably against state-of-the-art algorithms.
arXiv Detail & Related papers (2021-05-09T02:32:28Z) - Multi-Stage Progressive Image Restoration [167.6852235432918]
We propose a novel synergistic design that can optimally balance these competing goals.
Our main proposal is a multi-stage architecture, that progressively learns restoration functions for the degraded inputs.
The resulting tightly interlinked multi-stage architecture, named as MPRNet, delivers strong performance gains on ten datasets.
arXiv Detail & Related papers (2021-02-04T18:57:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.