Leveraging Tacit Information Embedded in CNN Layers for Visual Tracking
- URL: http://arxiv.org/abs/2010.01204v1
- Date: Fri, 2 Oct 2020 21:16:26 GMT
- Title: Leveraging Tacit Information Embedded in CNN Layers for Visual Tracking
- Authors: Kourosh Meshgi, Maryam Sadat Mirzaei, Shigeyuki Oba
- Abstract summary: We propose an adaptive combination of several CNN layers in a single DCF tracker to address variations of the target appearances.
Experiments demonstrate that using the additional implicit data layers of CNNs significantly improves the tracker.
- Score: 1.7188280334580193
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Different layers in CNNs provide not only different levels of abstraction for
describing the objects in the input but also encode various implicit
information about them. The activation patterns of different features contain
valuable information about the stream of incoming images: spatial relations,
temporal patterns, and co-occurrence of spatial and spatiotemporal (ST)
features. The studies in visual tracking literature, so far, utilized only one
of the CNN layers, a pre-fixed combination of them, or an ensemble of trackers
built upon individual layers. In this study, we employ an adaptive combination
of several CNN layers in a single DCF tracker to address variations of the
target appearances and propose the use of style statistics on both spatial and
temporal properties of the target, directly extracted from CNN layers for
visual tracking. Experiments demonstrate that using the additional implicit
data of CNNs significantly improves the performance of the tracker. Results
demonstrate the effectiveness of using style similarity and activation
consistency regularization in improving its localization and scale accuracy.
Related papers
- MSSTNet: A Multi-Scale Spatio-Temporal CNN-Transformer Network for Dynamic Facial Expression Recognition [4.512502015606517]
We propose a Multi-Scale-temporal CNN-Transformer network (MSSTNet)
Our approach takes spatial different scales extracted by CNN and feeds them into a Multi-scale Embedding Layer (MELayer)
The MELayer extracts multi-scale spatial information and encodes these features before sending them into a Transformer (T-Former)
arXiv Detail & Related papers (2024-04-12T12:30:48Z) - CSP: Self-Supervised Contrastive Spatial Pre-Training for
Geospatial-Visual Representations [90.50864830038202]
We present Contrastive Spatial Pre-Training (CSP), a self-supervised learning framework for geo-tagged images.
We use a dual-encoder to separately encode the images and their corresponding geo-locations, and use contrastive objectives to learn effective location representations from images.
CSP significantly boosts the model performance with 10-34% relative improvement with various labeled training data sampling ratios.
arXiv Detail & Related papers (2023-05-01T23:11:18Z) - A novel feature-scrambling approach reveals the capacity of
convolutional neural networks to learn spatial relations [0.0]
Convolutional neural networks (CNNs) are one of the most successful computer vision systems to solve object recognition.
Yet it remains poorly understood how CNNs actually make their decisions, what the nature of their internal representations is, and how their recognition strategies differ from humans.
arXiv Detail & Related papers (2022-12-12T16:40:29Z) - RGB-D SLAM Using Attention Guided Frame Association [11.484398586420067]
We propose the use of task specific network attention for RGB-D indoor SLAM.
We integrate layer-wise object attention information (layer gradients) with CNN layer representations to improve frame association performance.
Experiments show promising initial results with improved performance.
arXiv Detail & Related papers (2022-01-28T11:23:29Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Dense Interaction Learning for Video-based Person Re-identification [75.03200492219003]
We propose a hybrid framework, Dense Interaction Learning (DenseIL), to tackle video-based person re-ID difficulties.
DenseIL contains a CNN encoder and a Dense Interaction (DI) decoder.
Our experiments consistently and significantly outperform all the state-of-the-art methods on multiple standard video-based re-ID datasets.
arXiv Detail & Related papers (2021-03-16T12:22:08Z) - The Mind's Eye: Visualizing Class-Agnostic Features of CNNs [92.39082696657874]
We propose an approach to visually interpret CNN features given a set of images by creating corresponding images that depict the most informative features of a specific layer.
Our method uses a dual-objective activation and distance loss, without requiring a generator network nor modifications to the original model.
arXiv Detail & Related papers (2021-01-29T07:46:39Z) - Video-based Facial Expression Recognition using Graph Convolutional
Networks [57.980827038988735]
We introduce a Graph Convolutional Network (GCN) layer into a common CNN-RNN based model for video-based facial expression recognition.
We evaluate our method on three widely-used datasets, CK+, Oulu-CASIA and MMI, and also one challenging wild dataset AFEW8.0.
arXiv Detail & Related papers (2020-10-26T07:31:51Z) - Self-supervised Video Representation Learning by Uncovering
Spatio-temporal Statistics [74.6968179473212]
This paper proposes a novel pretext task to address the self-supervised learning problem.
We compute a series of partitioning-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion.
A neural network is built and trained to yield the statistical summaries given the video frames as inputs.
arXiv Detail & Related papers (2020-08-31T08:31:56Z) - Adaptive Exploitation of Pre-trained Deep Convolutional Neural Networks
for Robust Visual Tracking [14.627458410954628]
This paper provides a comprehensive analysis of four commonly used CNN models to determine the best feature maps of each model.
With the aid of analysis results as attribute dictionaries, adaptive exploitation of deep features is proposed to improve the accuracy and robustness of visual trackers.
arXiv Detail & Related papers (2020-08-29T17:09:43Z) - Decoding CNN based Object Classifier Using Visualization [6.666597301197889]
We visualize what type of features are extracted in different convolution layers of CNN.
Visualizing heat map of activation helps us to understand how CNN classifies and localizes different objects in image.
arXiv Detail & Related papers (2020-07-15T05:01:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.