Learning Spatio-Appearance Memory Network for High-Performance Visual
Tracking
- URL: http://arxiv.org/abs/2009.09669v5
- Date: Tue, 6 Apr 2021 05:37:10 GMT
- Title: Learning Spatio-Appearance Memory Network for High-Performance Visual
Tracking
- Authors: Fei Xie, Wankou Yang, Bo Liu, Kaihua Zhang, Wanli Xue, Wangmeng Zuo
- Abstract summary: Existing object tracking usually learns a bounding-box based template to match visual targets across frames, which cannot accurately learn a pixel-wise representation.
This paper presents a novel segmentation-based tracking architecture, which is equipped with a local-temporal memory network to learn accurate-temporal correspondence.
- Score: 79.80401607146987
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing visual object tracking usually learns a bounding-box based template
to match the targets across frames, which cannot accurately learn a pixel-wise
representation, thereby being limited in handling severe appearance variations.
To address these issues, much effort has been made on segmentation-based
tracking, which learns a pixel-wise object-aware template and can achieve
higher accuracy than bounding-box template based tracking. However, existing
segmentation-based trackers are ineffective in learning the spatio-temporal
correspondence across frames due to no use of the rich temporal information. To
overcome this issue, this paper presents a novel segmentation-based tracking
architecture, which is equipped with a spatio-appearance memory network to
learn accurate spatio-temporal correspondence. Among it, an appearance memory
network explores spatio-temporal non-local similarity to learn the dense
correspondence between the segmentation mask and the current frame. Meanwhile,
a spatial memory network is modeled as discriminative correlation filter to
learn the mapping between feature map and spatial map. The appearance memory
network helps to filter out the noisy samples in the spatial memory network
while the latter provides the former with more accurate target geometrical
center. This mutual promotion greatly boosts the tracking performance. Without
bells and whistles, our simple-yet-effective tracking architecture sets new
state-of-the-arts on the VOT2016, VOT2018, VOT2019, GOT-10K, TrackingNet, and
VOT2020 benchmarks, respectively. Besides, our tracker outperforms the leading
segmentation-based trackers SiamMask and D3S on two video object segmentation
benchmarks DAVIS16 and DAVIS17 by a large margin. The source codes can be found
at https://github.com/phiphiphi31/DMB.
Related papers
- A Discriminative Single-Shot Segmentation Network for Visual Object
Tracking [13.375369415113534]
We propose a discriminative single-shot segmentation tracker -- D3S2.
A single-shot network applies two target models with complementary geometric properties.
D3S2 outperforms the leading segmentation tracker SiamMask on video object segmentation benchmarks.
arXiv Detail & Related papers (2021-12-22T12:48:51Z) - Learning Dynamic Compact Memory Embedding for Deformable Visual Object
Tracking [82.34356879078955]
We propose a compact memory embedding to enhance the discrimination of the segmentation-based deformable visual tracking method.
Our method outperforms the excellent segmentation-based trackers, i.e., D3S and SiamMask on DAVIS 2017 benchmark.
arXiv Detail & Related papers (2021-11-23T03:07:12Z) - Multi-Object Tracking and Segmentation with a Space-Time Memory Network [12.043574473965318]
We propose a method for multi-object tracking and segmentation based on a novel memory-based mechanism to associate tracklets.
The proposed tracker, MeNToS, addresses particularly the long-term data association problem.
arXiv Detail & Related papers (2021-10-21T17:13:17Z) - Prototypical Cross-Attention Networks for Multiple Object Tracking and
Segmentation [95.74244714914052]
Multiple object tracking and segmentation requires detecting, tracking, and segmenting objects belonging to a set of given classes.
We propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich-temporal information online.
PCAN outperforms current video instance tracking and segmentation competition winners on Youtube-VIS and BDD100K datasets.
arXiv Detail & Related papers (2021-06-22T17:57:24Z) - TrTr: Visual Tracking with Transformer [29.415900191169587]
We propose a novel tracker network based on a powerful attention mechanism called Transformer encoder-decoder architecture.
We design the classification and regression heads using the output of Transformer to localize target based on shape-agnostic anchor.
Our method performs favorably against state-of-the-art algorithms.
arXiv Detail & Related papers (2021-05-09T02:32:28Z) - LiDAR-based Recurrent 3D Semantic Segmentation with Temporal Memory
Alignment [0.0]
We propose a recurrent segmentation architecture (RNN), which takes a single range image frame as input.
An alignment strategy, which we call Temporal Memory Alignment, uses ego motion to temporally align the memory between consecutive frames in feature space.
We demonstrate the benefits of the presented approach on two large-scale datasets and compare it to several stateof-the-art methods.
arXiv Detail & Related papers (2021-03-03T09:01:45Z) - Spatiotemporal Graph Neural Network based Mask Reconstruction for Video
Object Segmentation [70.97625552643493]
This paper addresses the task of segmenting class-agnostic objects in semi-supervised setting.
We propose a novel graph neuralS network (TG-Net) which captures the local contexts by utilizing all proposals.
arXiv Detail & Related papers (2020-12-10T07:57:44Z) - DS-Net: Dynamic Spatiotemporal Network for Video Salient Object
Detection [78.04869214450963]
We propose a novel dynamic temporal-temporal network (DSNet) for more effective fusion of temporal and spatial information.
We show that the proposed method achieves superior performance than state-of-the-art algorithms.
arXiv Detail & Related papers (2020-12-09T06:42:30Z) - Towards Accurate Pixel-wise Object Tracking by Attention Retrieval [50.06436600343181]
We propose an attention retrieval network (ARN) to perform soft spatial constraints on backbone features.
We set a new state-of-the-art on recent pixel-wise object tracking benchmark VOT 2020 while running at 40 fps.
arXiv Detail & Related papers (2020-08-06T16:25:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.