SiamMask: A Framework for Fast Online Object Tracking and Segmentation
- URL: http://arxiv.org/abs/2207.02088v1
- Date: Tue, 5 Jul 2022 14:47:17 GMT
- Title: SiamMask: A Framework for Fast Online Object Tracking and Segmentation
- Authors: Weiming Hu, Qiang Wang, Li Zhang, Luca Bertinetto, Philip H.S. Torr
- Abstract summary: SiamMask is a framework to perform both visual object tracking and video object segmentation, in real-time, with the same simple method.
We show that it is possible to extend the framework to handle multiple object tracking and segmentation by simply re-using the multi-task model.
It yields real-time state-of-the-art results on visual-object tracking benchmarks, while at the same time demonstrating competitive performance at a high speed for video object segmentation benchmarks.
- Score: 96.61632757952292
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper we introduce SiamMask, a framework to perform both visual
object tracking and video object segmentation, in real-time, with the same
simple method. We improve the offline training procedure of popular
fully-convolutional Siamese approaches by augmenting their losses with a binary
segmentation task. Once the offline training is completed, SiamMask only
requires a single bounding box for initialization and can simultaneously carry
out visual object tracking and segmentation at high frame-rates. Moreover, we
show that it is possible to extend the framework to handle multiple object
tracking and segmentation by simply re-using the multi-task model in a cascaded
fashion. Experimental results show that our approach has high processing
efficiency, at around 55 frames per second. It yields real-time
state-of-the-art results on visual-object tracking benchmarks, while at the
same time demonstrating competitive performance at a high speed for video
object segmentation benchmarks.
Related papers
- Learning Motion and Temporal Cues for Unsupervised Video Object Segmentation [49.113131249753714]
We propose an efficient algorithm, termed MTNet, which concurrently exploits motion and temporal cues.
MTNet is devised by effectively merging appearance and motion features during the feature extraction process within encoders.
We employ a cascade of decoders all feature levels across all feature levels to optimally exploit the derived features.
arXiv Detail & Related papers (2025-01-14T03:15:46Z) - Contrastive Lift: 3D Object Instance Segmentation by Slow-Fast
Contrastive Fusion [110.84357383258818]
We propose a novel approach to lift 2D segments to 3D and fuse them by means of a neural field representation.
The core of our approach is a slow-fast clustering objective function, which is scalable and well-suited for scenes with a large number of objects.
Our approach outperforms the state-of-the-art on challenging scenes from the ScanNet, Hypersim, and Replica datasets.
arXiv Detail & Related papers (2023-06-07T17:57:45Z) - Robust Online Video Instance Segmentation with Track Queries [15.834703258232002]
We propose a fully online transformer-based video instance segmentation model that performs comparably to top offline methods on the YouTube-VIS 2019 benchmark.
We show that, when combined with a strong enough image segmentation architecture, track queries can exhibit impressive accuracy while not being constrained to short videos.
arXiv Detail & Related papers (2022-11-16T18:50:14Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z) - Learning Fast and Robust Target Models for Video Object Segmentation [83.3382606349118]
Video object segmentation (VOS) is a highly challenging problem since the initial mask, defining the target object, is only given at test-time.
Most previous approaches fine-tune segmentation networks on the first frame, resulting in impractical frame-rates and risk of overfitting.
We propose a novel VOS architecture consisting of two network components.
arXiv Detail & Related papers (2020-02-27T21:58:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.