Make One-Shot Video Object Segmentation Efficient Again
- URL: http://arxiv.org/abs/2012.01866v1
- Date: Thu, 3 Dec 2020 12:21:23 GMT
- Title: Make One-Shot Video Object Segmentation Efficient Again
- Authors: Tim Meinhardt and Laura Leal-Taixe
- Abstract summary: Video object segmentation (VOS) describes the task of segmenting a set of objects in each frame of a video.
e-OSVOS decouples the object detection task and predicts only local segmentation masks by applying a modified version of Mask R-CNN.
e-OSVOS provides state-of-the-art results on DAVIS 2016, DAVIS 2017, and YouTube-VOS for one-shot fine-tuning methods.
- Score: 7.7415390727490445
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video object segmentation (VOS) describes the task of segmenting a set of
objects in each frame of a video. In the semi-supervised setting, the first
mask of each object is provided at test time. Following the one-shot principle,
fine-tuning VOS methods train a segmentation model separately on each given
object mask. However, recently the VOS community has deemed such a test time
optimization and its impact on the test runtime as unfeasible. To mitigate the
inefficiencies of previous fine-tuning approaches, we present efficient
One-Shot Video Object Segmentation (e-OSVOS). In contrast to most VOS
approaches, e-OSVOS decouples the object detection task and predicts only local
segmentation masks by applying a modified version of Mask R-CNN. The one-shot
test runtime and performance are optimized without a laborious and handcrafted
hyperparameter search. To this end, we meta learn the model initialization and
learning rates for the test time optimization. To achieve optimal learning
behavior, we predict individual learning rates at a neuron level. Furthermore,
we apply an online adaptation to address the common performance degradation
throughout a sequence by continuously fine-tuning the model on previous mask
predictions supported by a frame-to-frame bounding box propagation. e-OSVOS
provides state-of-the-art results on DAVIS 2016, DAVIS 2017, and YouTube-VOS
for one-shot fine-tuning methods while reducing the test runtime substantially.
Code is available at https://github.com/dvl-tum/e-osvos.
Related papers
- SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - 1st Place Solution for 5th LSVOS Challenge: Referring Video Object
Segmentation [65.45702890457046]
We integrate strengths of leading RVOS models to build up an effective paradigm.
To improve the consistency and quality of masks, we propose Two-Stage Multi-Model Fusion strategy.
Our method achieves 75.7% J&F on Ref-Youtube-VOS validation set and 70% J&F on test set, which ranks 1st place on 5th Large-scale Video Object Challenge (ICCV 2023) track 3.
arXiv Detail & Related papers (2024-01-01T04:24:48Z) - Box Supervised Video Segmentation Proposal Network [3.384080569028146]
We propose a box-supervised video object segmentation proposal network, which takes advantage of intrinsic video properties.
The proposed method outperforms the state-of-the-art self-supervised benchmark by 16.4% and 6.9%.
We provide extensive tests and ablations on the datasets, demonstrating the robustness of our method.
arXiv Detail & Related papers (2022-02-14T20:38:28Z) - FAMINet: Learning Real-time Semi-supervised Video Object Segmentation
with Steepest Optimized Optical Flow [21.45623125216448]
Semi-supervised video object segmentation (VOS) aims to segment a few moving objects in a video sequence, where these objects are specified by annotation of first frame.
The optical flow has been considered in many existing semi-supervised VOS methods to improve the segmentation accuracy.
A FAMINet, which consists of a feature extraction network (F), an appearance network (A), a motion network (M), and an integration network (I), is proposed in this study to address the abovementioned problem.
arXiv Detail & Related papers (2021-11-20T07:24:33Z) - Learning Dynamic Network Using a Reuse Gate Function in Semi-supervised
Video Object Segmentation [27.559093073097483]
Current approaches for Semi-supervised Video Object (Semi-VOS) propagates information from previous frames to generate segmentation mask for the current frame.
We exploit this observation by using temporal information to quickly identify frames with minimal change.
We propose a novel dynamic network that estimates change across frames and decides which path -- computing a full network or reusing previous frame's feature -- to choose.
arXiv Detail & Related papers (2020-12-21T19:40:17Z) - Spatiotemporal Graph Neural Network based Mask Reconstruction for Video
Object Segmentation [70.97625552643493]
This paper addresses the task of segmenting class-agnostic objects in semi-supervised setting.
We propose a novel graph neuralS network (TG-Net) which captures the local contexts by utilizing all proposals.
arXiv Detail & Related papers (2020-12-10T07:57:44Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z) - Learning What to Learn for Video Object Segmentation [157.4154825304324]
We introduce an end-to-end trainable VOS architecture that integrates a differentiable few-shot learning module.
This internal learner is designed to predict a powerful parametric model of the target.
We set a new state-of-the-art on the large-scale YouTube-VOS 2018 dataset by achieving an overall score of 81.5.
arXiv Detail & Related papers (2020-03-25T17:58:43Z) - Learning Fast and Robust Target Models for Video Object Segmentation [83.3382606349118]
Video object segmentation (VOS) is a highly challenging problem since the initial mask, defining the target object, is only given at test-time.
Most previous approaches fine-tune segmentation networks on the first frame, resulting in impractical frame-rates and risk of overfitting.
We propose a novel VOS architecture consisting of two network components.
arXiv Detail & Related papers (2020-02-27T21:58:06Z) - Directional Deep Embedding and Appearance Learning for Fast Video Object
Segmentation [11.10636117512819]
We propose a directional deep embedding and YouTube appearance learning (DEmbed) method, which is free of the online fine-tuning process.
Our method achieves a state-of-the-art VOS performance without using online fine-tuning.
arXiv Detail & Related papers (2020-02-17T01:51:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.