Hybrid-S2S: Video Object Segmentation with Recurrent Networks and
Correspondence Matching
- URL: http://arxiv.org/abs/2010.05069v2
- Date: Sat, 7 Nov 2020 09:33:51 GMT
- Title: Hybrid-S2S: Video Object Segmentation with Recurrent Networks and
Correspondence Matching
- Authors: Fatemeh Azimi and Stanislav Frolov and Federico Raue and Joern Hees
and Andreas Dengel
- Abstract summary: One-shot Video Object(VOS) is the task of tracking an object of interest within a video sequence.
We study an RNN-based architecture and address some of these issues by proposing a hybrid sequence-to-sequence architecture named HS2S.
Our experiments show that augmenting the RNN with correspondence matching is a highly effective solution to reduce the drift problem.
- Score: 3.9053553775979086
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One-shot Video Object Segmentation~(VOS) is the task of pixel-wise tracking
an object of interest within a video sequence, where the segmentation mask of
the first frame is given at inference time. In recent years, Recurrent Neural
Networks~(RNNs) have been widely used for VOS tasks, but they often suffer from
limitations such as drift and error propagation. In this work, we study an
RNN-based architecture and address some of these issues by proposing a hybrid
sequence-to-sequence architecture named HS2S, utilizing a dual mask propagation
strategy that allows incorporating the information obtained from correspondence
matching. Our experiments show that augmenting the RNN with correspondence
matching is a highly effective solution to reduce the drift problem. The
additional information helps the model to predict more accurate masks and makes
it robust against error propagation. We evaluate our HS2S model on the
DAVIS2017 dataset as well as Youtube-VOS. On the latter, we achieve an
improvement of 11.2pp in the overall segmentation accuracy over RNN-based
state-of-the-art methods in VOS. We analyze our model's behavior in challenging
cases such as occlusion and long sequences and show that our hybrid
architecture significantly enhances the segmentation quality in these difficult
scenarios.
Related papers
- SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Unsupervised Flow-Aligned Sequence-to-Sequence Learning for Video
Restoration [85.3323211054274]
How to properly model the inter-frame relation within the video sequence is an important but unsolved challenge for video restoration (VR)
In this work, we propose an unsupervised flow-aligned sequence-to-sequence model (S2SVR) to address this problem.
S2SVR shows superior performance in multiple VR tasks, including video deblurring, video super-resolution, and compressed video quality enhancement.
arXiv Detail & Related papers (2022-05-20T14:14:48Z) - 1st Place Solution for YouTubeVOS Challenge 2021:Video Instance
Segmentation [0.39146761527401414]
Video Instance (VIS) is a multi-task problem performing detection, segmentation, and tracking simultaneously.
We propose two modules, named Temporally Correlated Instance (TCIS) and Bidirectional Tracking (BiTrack)
By combining these techniques with a bag of tricks, the network performance is significantly boosted compared to the baseline.
arXiv Detail & Related papers (2021-06-12T00:20:38Z) - Scene Understanding for Autonomous Driving [0.0]
We study the behaviour of different configurations of RetinaNet, Faster R-CNN and Mask R-CNN presented in Detectron2.
We observe a significant improvement in performance after fine-tuning these models on the datasets of interest.
We run inference in unusual situations using out of context datasets, and present interesting results.
arXiv Detail & Related papers (2021-05-11T09:50:05Z) - Deep Cellular Recurrent Network for Efficient Analysis of Time-Series
Data with Spatial Information [52.635997570873194]
This work proposes a novel deep cellular recurrent neural network (DCRNN) architecture to process complex multi-dimensional time series data with spatial information.
The proposed architecture achieves state-of-the-art performance while utilizing substantially less trainable parameters when compared to comparable methods in the literature.
arXiv Detail & Related papers (2021-01-12T20:08:18Z) - Spatiotemporal Graph Neural Network based Mask Reconstruction for Video
Object Segmentation [70.97625552643493]
This paper addresses the task of segmenting class-agnostic objects in semi-supervised setting.
We propose a novel graph neuralS network (TG-Net) which captures the local contexts by utilizing all proposals.
arXiv Detail & Related papers (2020-12-10T07:57:44Z) - Recent Developments Combining Ensemble Smoother and Deep Generative
Networks for Facies History Matching [58.720142291102135]
This research project focuses on the use of autoencoders networks to construct a continuous parameterization for facies models.
We benchmark seven different formulations, including VAE, generative adversarial network (GAN), Wasserstein GAN, variational auto-encoding GAN, principal component analysis (PCA) with cycle GAN, PCA with transfer style network and VAE with style loss.
arXiv Detail & Related papers (2020-05-08T21:32:42Z) - Zero-Shot Video Object Segmentation via Attentive Graph Neural Networks [150.5425122989146]
This work proposes a novel attentive graph neural network (AGNN) for zero-shot video object segmentation (ZVOS)
AGNN builds a fully connected graph to efficiently represent frames as nodes, and relations between arbitrary frame pairs as edges.
Experimental results on three video segmentation datasets show that AGNN sets a new state-of-the-art in each case.
arXiv Detail & Related papers (2020-01-19T10:45:27Z) - Depth-wise Decomposition for Accelerating Separable Convolutions in
Efficient Convolutional Neural Networks [36.64158994999578]
Deep convolutional neural networks (CNNs) have been established as the primary methods for many computer vision tasks.
Recently, depth-wise separable convolution has been proposed for image recognition tasks on computationally limited platforms.
We propose a novel decomposition approach based on SVD, namely depth-wise decomposition, for expanding regular convolutions into depthwise separable convolutions.
arXiv Detail & Related papers (2019-10-21T15:37:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.