Learning Video Object Segmentation from Unlabeled Videos
- URL: http://arxiv.org/abs/2003.05020v1
- Date: Tue, 10 Mar 2020 22:12:15 GMT
- Title: Learning Video Object Segmentation from Unlabeled Videos
- Authors: Xiankai Lu, Wenguan Wang, Jianbing Shen, Yu-Wing Tai, David Crandall,
and Steven C. H. Hoi
- Abstract summary: We propose a new method for video object segmentation (VOS) that addresses object pattern learning from unlabeled videos.
We introduce a unified unsupervised/weakly supervised learning framework, called MuG, that comprehensively captures properties of VOS at multiple granularities.
- Score: 158.18207922363783
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a new method for video object segmentation (VOS) that addresses
object pattern learning from unlabeled videos, unlike most existing methods
which rely heavily on extensive annotated data. We introduce a unified
unsupervised/weakly supervised learning framework, called MuG, that
comprehensively captures intrinsic properties of VOS at multiple granularities.
Our approach can help advance understanding of visual patterns in VOS and
significantly reduce annotation burden. With a carefully-designed architecture
and strong representation learning ability, our learned model can be applied to
diverse VOS settings, including object-level zero-shot VOS, instance-level
zero-shot VOS, and one-shot VOS. Experiments demonstrate promising performance
in these settings, as well as the potential of MuG in leveraging unlabeled data
to further improve the segmentation accuracy.
Related papers
- Point-VOS: Pointing Up Video Object Segmentation [16.359861197595986]
Current state-of-the-art Video Object (VOS) methods rely on dense per-object mask annotations both during training and testing.
We propose a novel Point-VOS task with a sparse-temporally point-wise annotation scheme that substantially reduces the effort.
We show that our data can be used to improve models that connect vision and language, by evaluating it on the Video Narrative Grounding (VNG) task.
arXiv Detail & Related papers (2024-02-08T18:52:23Z) - Learning the What and How of Annotation in Video Object Segmentation [11.012995995497029]
Video Object (VOS) is crucial for several applications, from video editing to video data generation.
Traditional way of annotating objects requires humans to draw detailed segmentation masks on the target objects at each video frame.
We propose EVA-VOS, a human-in-the-loop annotation framework for video object segmentation.
arXiv Detail & Related papers (2023-11-08T00:56:31Z) - Learning Referring Video Object Segmentation from Weak Annotation [78.45828085350936]
Referring video object segmentation (RVOS) is a task that aims to segment the target object in all video frames based on a sentence describing the object.
We propose a new annotation scheme that reduces the annotation effort by 8 times, while providing sufficient supervision for RVOS.
Our scheme only requires a mask for the frame where the object first appears and bounding boxes for the rest of the frames.
arXiv Detail & Related papers (2023-08-04T06:50:52Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - Self-Supervised Video Representation Learning with Motion-Contrastive
Perception [13.860736711747284]
Motion-Contrastive Perception Network (MCPNet)
MCPNet consists of two branches, namely, Motion Information Perception (MIP) and Contrastive Instance Perception (CIP)
Our method outperforms current state-of-the-art visual-only self-supervised approaches.
arXiv Detail & Related papers (2022-04-10T05:34:46Z) - In-N-Out Generative Learning for Dense Unsupervised Video Segmentation [89.21483504654282]
In this paper, we focus on the unsupervised Video Object (VOS) task which learns visual correspondence from unlabeled videos.
We propose the In-aNd-Out (INO) generative learning from a purely generative perspective, which captures both high-level and fine-grained semantics.
Our INO outperforms previous state-of-the-art methods by significant margins.
arXiv Detail & Related papers (2022-03-29T07:56:21Z) - Reducing the Annotation Effort for Video Object Segmentation Datasets [50.893073670389164]
densely labeling every frame with pixel masks does not scale to large datasets.
We use a deep convolutional network to automatically create pseudo-labels on a pixel level from much cheaper bounding box annotations.
We obtain the new TAO-VOS benchmark, which we make publicly available at www.vision.rwth-aachen.de/page/taovos.
arXiv Detail & Related papers (2020-11-02T17:34:45Z) - Learning What to Learn for Video Object Segmentation [157.4154825304324]
We introduce an end-to-end trainable VOS architecture that integrates a differentiable few-shot learning module.
This internal learner is designed to predict a powerful parametric model of the target.
We set a new state-of-the-art on the large-scale YouTube-VOS 2018 dataset by achieving an overall score of 81.5.
arXiv Detail & Related papers (2020-03-25T17:58:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.