Towards Accurate Pixel-wise Object Tracking by Attention Retrieval
- URL: http://arxiv.org/abs/2008.02745v3
- Date: Tue, 8 Sep 2020 02:06:33 GMT
- Title: Towards Accurate Pixel-wise Object Tracking by Attention Retrieval
- Authors: Zhipeng Zhang, Bing Li, Weiming Hu, Houwen Peng
- Abstract summary: We propose an attention retrieval network (ARN) to perform soft spatial constraints on backbone features.
We set a new state-of-the-art on recent pixel-wise object tracking benchmark VOT 2020 while running at 40 fps.
- Score: 50.06436600343181
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The encoding of the target in object tracking moves from the coarse
bounding-box to fine-grained segmentation map recently. Revisiting de facto
real-time approaches that are capable of predicting mask during tracking, we
observed that they usually fork a light branch from the backbone network for
segmentation. Although efficient, directly fusing backbone features without
considering the negative influence of background clutter tends to introduce
false-negative predictions, lagging the segmentation accuracy. To mitigate this
problem, we propose an attention retrieval network (ARN) to perform soft
spatial constraints on backbone features. We first build a look-up-table (LUT)
with the ground-truth mask in the starting frame, and then retrieves the LUT to
obtain an attention map for spatial constraints. Moreover, we introduce a
multi-resolution multi-stage segmentation network (MMS) to further weaken the
influence of background clutter by reusing the predicted mask to filter
backbone features. Our approach set a new state-of-the-art on recent pixel-wise
object tracking benchmark VOT2020 while running at 40 fps. Notably, the
proposed model surpasses SiamMask by 11.7/4.2/5.5 points on VOT2020, DAVIS2016,
and DAVIS2017, respectively. We will release our code at
https://github.com/researchmm/TracKit.
Related papers
- LAC-Net: Linear-Fusion Attention-Guided Convolutional Network for Accurate Robotic Grasping Under the Occlusion [79.22197702626542]
This paper introduces a framework that explores amodal segmentation for robotic grasping in cluttered scenes.
We propose a Linear-fusion Attention-guided Convolutional Network (LAC-Net)
The results on different datasets show that our method achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-08-06T14:50:48Z) - Visual Multi-Object Tracking with Re-Identification and Occlusion Handling using Labeled Random Finite Sets [10.618186767487993]
This paper proposes an online visual multi-object tracking (MOT) algorithm that resolves object appearance-reappearance and occlusion.
Our solution is based on the labeled random finite set (LRFS) filtering approach.
We propose a fuzzy detection model that takes into consideration the overlapping areas between tracks and their sizes.
arXiv Detail & Related papers (2024-07-11T21:15:21Z) - Robust Visual Tracking by Segmentation [103.87369380021441]
Estimating the target extent poses a fundamental challenge in visual object tracking.
We propose a segmentation-centric tracking pipeline that produces a highly accurate segmentation mask.
Our tracker is able to better learn a target representation that clearly differentiates the target in the scene from background content.
arXiv Detail & Related papers (2022-03-21T17:59:19Z) - Object Propagation via Inter-Frame Attentions for Temporally Stable
Video Instance Segmentation [51.68840525174265]
Video instance segmentation aims to detect, segment, and track objects in a video.
Current approaches extend image-level segmentation algorithms to the temporal domain.
We propose a video instance segmentation method that alleviates the problem due to missing detections.
arXiv Detail & Related papers (2021-11-15T04:15:57Z) - Spatiotemporal Graph Neural Network based Mask Reconstruction for Video
Object Segmentation [70.97625552643493]
This paper addresses the task of segmenting class-agnostic objects in semi-supervised setting.
We propose a novel graph neuralS network (TG-Net) which captures the local contexts by utilizing all proposals.
arXiv Detail & Related papers (2020-12-10T07:57:44Z) - Learning Spatio-Appearance Memory Network for High-Performance Visual
Tracking [79.80401607146987]
Existing object tracking usually learns a bounding-box based template to match visual targets across frames, which cannot accurately learn a pixel-wise representation.
This paper presents a novel segmentation-based tracking architecture, which is equipped with a local-temporal memory network to learn accurate-temporal correspondence.
arXiv Detail & Related papers (2020-09-21T08:12:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.