HiFT: Hierarchical Feature Transformer for Aerial Tracking
- URL: http://arxiv.org/abs/2108.00202v1
- Date: Sat, 31 Jul 2021 10:04:45 GMT
- Title: HiFT: Hierarchical Feature Transformer for Aerial Tracking
- Authors: Ziang Cao, Changhong Fu, Junjie Ye, Bowen Li, and Yiming Li
- Abstract summary: We propose an efficient and effective hierarchical feature transformer (HiFT) for aerial tracking.
HiFT uses multi-level convolutional layers to achieve the interactive fusion of spatial (shallow layers) and semantics cues (deep layers)
Comprehensive evaluations on four aerial benchmarks have proven the effectiveness of HiFT.
- Score: 16.78336740951222
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most existing Siamese-based tracking methods execute the classification and
regression of the target object based on the similarity maps. However, they
either employ a single map from the last convolutional layer which degrades the
localization accuracy in complex scenarios or separately use multiple maps for
decision making, introducing intractable computations for aerial mobile
platforms. Thus, in this work, we propose an efficient and effective
hierarchical feature transformer (HiFT) for aerial tracking. Hierarchical
similarity maps generated by multi-level convolutional layers are fed into the
feature transformer to achieve the interactive fusion of spatial (shallow
layers) and semantics cues (deep layers). Consequently, not only the global
contextual information can be raised, facilitating the target search, but also
our end-to-end architecture with the transformer can efficiently learn the
interdependencies among multi-level features, thereby discovering a
tracking-tailored feature space with strong discriminability. Comprehensive
evaluations on four aerial benchmarks have proven the effectiveness of HiFT.
Real-world tests on the aerial platform have strongly validated its
practicability with a real-time speed. Our code is available at
https://github.com/vision4robotics/HiFT.
Related papers
- ETO:Efficient Transformer-based Local Feature Matching by Organizing Multiple Homography Hypotheses [35.31588965060201]
We propose an efficient transformer-based network architecture for local feature matching.
On the YFCC100M dataset, our matching accuracy is competitive with LoFTR, a state-of-the-art transformer-based architecture.
arXiv Detail & Related papers (2024-10-30T06:39:27Z) - EgoVM: Achieving Precise Ego-Localization using Lightweight Vectorized
Maps [9.450650025266379]
We present EgoVM, an end-to-end localization network that achieves comparable localization accuracy to prior state-of-the-art methods.
We employ a set of learnable semantic embeddings to encode the semantic types of map elements and supervise them with semantic segmentation.
We adopt a robust histogram-based pose solver to estimate the optimal pose by searching exhaustively over candidate poses.
arXiv Detail & Related papers (2023-07-18T06:07:25Z) - Dynamic Perceiver for Efficient Visual Recognition [87.08210214417309]
We propose Dynamic Perceiver (Dyn-Perceiver) to decouple the feature extraction procedure and the early classification task.
A feature branch serves to extract image features, while a classification branch processes a latent code assigned for classification tasks.
Early exits are placed exclusively within the classification branch, thus eliminating the need for linear separability in low-level features.
arXiv Detail & Related papers (2023-06-20T03:00:22Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - Unifying Global-Local Representations in Salient Object Detection with Transformer [55.23033277636774]
We introduce a new attention-based encoder, vision transformer, into salient object detection.
With the global view in very shallow layers, the transformer encoder preserves more local representations.
Our method significantly outperforms other FCN-based and transformer-based methods in five benchmarks.
arXiv Detail & Related papers (2021-08-05T17:51:32Z) - TransMOT: Spatial-Temporal Graph Transformer for Multiple Object
Tracking [74.82415271960315]
We propose a solution named TransMOT to efficiently model the spatial and temporal interactions among objects in a video.
TransMOT is not only more computationally efficient than the traditional Transformer, but it also achieves better tracking accuracy.
The proposed method is evaluated on multiple benchmark datasets including MOT15, MOT16, MOT17, and MOT20.
arXiv Detail & Related papers (2021-04-01T01:49:05Z) - Multiple Convolutional Features in Siamese Networks for Object Tracking [13.850110645060116]
Multiple Features-Siamese Tracker (MFST) is a novel tracking algorithm exploiting several hierarchical feature maps for robust tracking.
MFST achieves high tracking accuracy, while outperforming the standard siamese tracker on object tracking benchmarks.
arXiv Detail & Related papers (2021-03-01T08:02:27Z) - Learning to Generate Content-Aware Dynamic Detectors [62.74209921174237]
We introduce a newpective of designing efficient detectors, which is automatically generating sample-adaptive model architecture.
We introduce a course-to-fine strat-egy tailored for object detection to guide the learning of dynamic routing.
Experiments on MS-COCO dataset demonstrate that CADDet achieves 1.8 higher mAP with 10% fewer FLOPs compared with vanilla routing.
arXiv Detail & Related papers (2020-12-08T08:05:20Z) - Cross-layer Feature Pyramid Network for Salient Object Detection [102.20031050972429]
We propose a novel Cross-layer Feature Pyramid Network to improve the progressive fusion in salient object detection.
The distributed features per layer own both semantics and salient details from all other layers simultaneously, and suffer reduced loss of important information.
arXiv Detail & Related papers (2020-02-25T14:06:27Z) - PLLay: Efficient Topological Layer based on Persistence Landscapes [24.222495922671442]
PLLay is a novel topological layer for general deep learning models based on persistence landscapes.
We show differentiability with respect to layer inputs, for a general persistent homology with arbitrary filtration.
arXiv Detail & Related papers (2020-02-07T13:34:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.