Learning Target-aware Representation for Visual Tracking via Informative
Interactions
- URL: http://arxiv.org/abs/2201.02526v1
- Date: Fri, 7 Jan 2022 16:22:27 GMT
- Title: Learning Target-aware Representation for Visual Tracking via Informative
Interactions
- Authors: Mingzhe Guo, Zhipeng Zhang, Heng Fan, Liping Jing, Yilin Lyu, Bing Li,
Weiming Hu
- Abstract summary: We introduce a novel backbone architecture to improve target-perception ability of feature representation for tracking.
The proposed GIM module and InBN mechanism are general and applicable to different backbone types including CNN and Transformer.
- Score: 49.552877881662475
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a novel backbone architecture to improve target-perception
ability of feature representation for tracking. Specifically, having observed
that de facto frameworks perform feature matching simply using the outputs from
backbone for target localization, there is no direct feedback from the matching
module to the backbone network, especially the shallow layers. More concretely,
only the matching module can directly access the target information (in the
reference frame), while the representation learning of candidate frame is blind
to the reference target. As a consequence, the accumulation effect of
target-irrelevant interference in the shallow stages may degrade the feature
quality of deeper layers. In this paper, we approach the problem from a
different angle by conducting multiple branch-wise interactions inside the
Siamese-like backbone networks (InBN). At the core of InBN is a general
interaction modeler (GIM) that injects the prior knowledge of reference image
to different stages of the backbone network, leading to better
target-perception and robust distractor-resistance of candidate feature
representation with negligible computation cost. The proposed GIM module and
InBN mechanism are general and applicable to different backbone types including
CNN and Transformer for improvements, as evidenced by our extensive experiments
on multiple benchmarks. In particular, the CNN version (based on SiamCAR)
improves the baseline with 3.2/6.9 absolute gains of SUC on LaSOT/TNL2K,
respectively. The Transformer version obtains SUC scores of 65.7/52.0 on
LaSOT/TNL2K, which are on par with recent state of the arts. Code and models
will be released.
Related papers
- Unveiling Backbone Effects in CLIP: Exploring Representational Synergies
and Variances [49.631908848868505]
Contrastive Language-Image Pretraining (CLIP) stands out as a prominent method for image representation learning.
We investigate the differences in CLIP performance among various neural architectures.
We propose a simple, yet effective approach to combine predictions from multiple backbones, leading to a notable performance boost of up to 6.34%.
arXiv Detail & Related papers (2023-12-22T03:01:41Z) - Salient Object Detection in Optical Remote Sensing Images Driven by
Transformer [69.22039680783124]
We propose a novel Global Extraction Local Exploration Network (GeleNet) for Optical Remote Sensing Images (ORSI-SOD)
Specifically, GeleNet first adopts a transformer backbone to generate four-level feature embeddings with global long-range dependencies.
Extensive experiments on three public datasets demonstrate that the proposed GeleNet outperforms relevant state-of-the-art methods.
arXiv Detail & Related papers (2023-09-15T07:14:43Z) - OST: Efficient One-stream Network for 3D Single Object Tracking in Point Clouds [6.661881950861012]
We propose a novel one-stream network with the strength of the instance-level encoding, which avoids the correlation operations occurring in previous Siamese network.
The proposed method has achieved considerable performance not only for class-specific tracking but also for class-agnostic tracking with less computation and higher efficiency.
arXiv Detail & Related papers (2022-10-16T12:31:59Z) - NL-FCOS: Improving FCOS through Non-Local Modules for Object Detection [0.0]
We show that non-local modules combined with an FCOS head (NL-FCOS) are practical and efficient.
We establish state-of-the-art performance in clothing detection and handwritten amount recognition problems.
arXiv Detail & Related papers (2022-03-29T15:00:14Z) - Backbone is All Your Need: A Simplified Architecture for Visual Object
Tracking [69.08903927311283]
Existing tracking approaches rely on customized sub-modules and need prior knowledge for architecture selection.
This paper presents a simplified tracking architecture (SimTrack) by leveraging a transformer backbone for joint feature extraction and interaction.
Our SimTrack improves the baseline with 2.5%/2.6% AUC gains on LaSOT/TNL2K and gets results competitive with other specialized tracking algorithms without bells and whistles.
arXiv Detail & Related papers (2022-03-10T12:20:58Z) - Recurrence along Depth: Deep Convolutional Neural Networks with
Recurrent Layer Aggregation [5.71305698739856]
This paper introduces a concept of layer aggregation to describe how information from previous layers can be reused to better extract features at the current layer.
We propose a very light-weighted module, called recurrent layer aggregation (RLA), by making use of the sequential structure of layers in a deep CNN.
Our RLA module is compatible with many mainstream deep CNNs, including ResNets, Xception and MobileNetV2.
arXiv Detail & Related papers (2021-10-22T15:36:33Z) - Learning Deep Interleaved Networks with Asymmetric Co-Attention for
Image Restoration [65.11022516031463]
We present a deep interleaved network (DIN) that learns how information at different states should be combined for high-quality (HQ) images reconstruction.
In this paper, we propose asymmetric co-attention (AsyCA) which is attached at each interleaved node to model the feature dependencies.
Our presented DIN can be trained end-to-end and applied to various image restoration tasks.
arXiv Detail & Related papers (2020-10-29T15:32:00Z) - Dynamic Hierarchical Mimicking Towards Consistent Optimization
Objectives [73.15276998621582]
We propose a generic feature learning mechanism to advance CNN training with enhanced generalization ability.
Partially inspired by DSN, we fork delicately designed side branches from the intermediate layers of a given neural network.
Experiments on both category and instance recognition tasks demonstrate the substantial improvements of our proposed method.
arXiv Detail & Related papers (2020-03-24T09:56:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.