LSMVOS: Long-Short-Term Similarity Matching for Video Object
- URL: http://arxiv.org/abs/2009.00771v1
- Date: Wed, 2 Sep 2020 01:32:05 GMT
- Title: LSMVOS: Long-Short-Term Similarity Matching for Video Object
- Authors: Zhang Xuerui, Yuan Xia
- Abstract summary: Semi-supervised video object segmentation refers to segmenting the object in subsequent frames given the object label in the first frame.
This paper explores a new propagation method, uses short-term matching modules to extract the information of the previous frame and apply it in propagation.
By combining the long-term matching module with the short-term matching module, the whole network can achieve efficient video object segmentation without online fine tuning.
- Score: 3.3518869877513895
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Objective Semi-supervised video object segmentation refers to segmenting the
object in subsequent frames given the object label in the first frame. Existing
algorithms are mostly based on the objectives of matching and propagation
strategies, which often make use of the previous frame with masking or optical
flow. This paper explores a new propagation method, uses short-term matching
modules to extract the information of the previous frame and apply it in
propagation, and proposes the network of Long-Short-Term similarity matching
for video object segmentation (LSMOVS) Method: By conducting pixel-level
matching and correlation between long-term matching module and short-term
matching module with the first frame and previous frame, global similarity map
and local similarity map are obtained, as well as feature pattern of current
frame and masking of previous frame. After two refine networks, final results
are obtained through segmentation network. Results: According to the
experiments on the two data sets DAVIS 2016 and 2017, the method of this paper
achieves favorable average of region similarity and contour accuracy without
online fine tuning, which achieves 86.5% and 77.4% in terms of single target
and multiple targets. Besides, the count of segmented frames per second reached
21. Conclusion: The short-term matching module proposed in this paper is more
conducive to extracting the information of the previous frame than only the
mask. By combining the long-term matching module with the short-term matching
module, the whole network can achieve efficient video object segmentation
without online fine tuning
Related papers
- Efficient Long-Short Temporal Attention Network for Unsupervised Video
Object Segmentation [23.645412918420906]
Unsupervised Video Object (VOS) aims at identifying the contours of primary foreground objects in videos without any prior knowledge.
Previous methods do not fully use spatial-temporal context and fail to tackle this challenging task in real-time.
This motivates us to develop an efficient Long-Short Temporal Attention network (termed LSTA) for unsupervised VOS task from a holistic view.
arXiv Detail & Related papers (2023-09-21T01:09:46Z) - Learning to Associate Every Segment for Video Panoptic Segmentation [123.03617367709303]
We learn coarse segment-level matching and fine pixel-level matching together.
We show that our per-frame computation model can achieve new state-of-the-art results on Cityscapes-VPS and VIPER datasets.
arXiv Detail & Related papers (2021-06-17T13:06:24Z) - Guided Interactive Video Object Segmentation Using Reliability-Based
Attention Maps [55.94785248905853]
We propose a novel guided interactive segmentation (GIS) algorithm for video objects to improve the segmentation accuracy and reduce the interaction time.
We develop the intersection-aware propagation module to propagate segmentation results to neighboring frames.
Experimental results demonstrate that the proposed algorithm provides more accurate segmentation results at a faster speed than conventional algorithms.
arXiv Detail & Related papers (2021-04-21T07:08:57Z) - Video Instance Segmentation with a Propose-Reduce Paradigm [68.59137660342326]
Video instance segmentation (VIS) aims to segment and associate all instances of predefined classes for each frame in videos.
Prior methods usually obtain segmentation for a frame or clip first, and then merge the incomplete results by tracking or matching.
We propose a new paradigm -- Propose-Reduce, to generate complete sequences for input videos by a single step.
arXiv Detail & Related papers (2021-03-25T10:58:36Z) - Local Memory Attention for Fast Video Semantic Segmentation [157.7618884769969]
We propose a novel neural network module that transforms an existing single-frame semantic segmentation model into a video semantic segmentation pipeline.
Our approach aggregates a rich representation of the semantic information in past frames into a memory module.
We observe an improvement in segmentation performance on Cityscapes by 1.7% and 2.1% in mIoU respectively, while increasing inference time of ERFNet by only 1.5ms.
arXiv Detail & Related papers (2021-01-05T18:57:09Z) - Interactive Video Object Segmentation Using Global and Local Transfer
Modules [51.93009196085043]
We develop a deep neural network, which consists of the annotation network (A-Net) and the transfer network (T-Net)
Given user scribbles on a frame, A-Net yields a segmentation result based on the encoder-decoder architecture.
We train the entire network in two stages, by emulating user scribbles and employing an auxiliary loss.
arXiv Detail & Related papers (2020-07-16T06:49:07Z) - Revisiting Sequence-to-Sequence Video Object Segmentation with
Multi-Task Loss and Skip-Memory [4.343892430915579]
Video Object (VOS) is an active research area of the visual domain.
Current approaches lose objects in longer sequences, especially when the object is small or briefly occluded.
We build upon a sequence-to-sequence approach that employs an encoder-decoder architecture together with a memory module for exploiting the sequential data.
arXiv Detail & Related papers (2020-04-25T15:38:09Z) - CRNet: Cross-Reference Networks for Few-Shot Segmentation [59.85183776573642]
Few-shot segmentation aims to learn a segmentation model that can be generalized to novel classes with only a few training images.
With a cross-reference mechanism, our network can better find the co-occurrent objects in the two images.
Experiments on the PASCAL VOC 2012 dataset show that our network achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-03-24T04:55:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.