Two-shot Video Object Segmentation
- URL: http://arxiv.org/abs/2303.12078v1
- Date: Tue, 21 Mar 2023 17:59:56 GMT
- Title: Two-shot Video Object Segmentation
- Authors: Kun Yan, Xiao Li, Fangyun Wei, Jinglu Wang, Chenbin Zhang, Ping Wang,
Yan Lu
- Abstract summary: We train a video object segmentation model on sparsely annotated videos.
We generate pseudo labels for unlabeled frames and optimize the model on the combination of labeled and pseudo-labeled data.
For the first time, we present a general way to train VOS models on two-shot VOS datasets.
- Score: 35.48207692959968
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Previous works on video object segmentation (VOS) are trained on densely
annotated videos. Nevertheless, acquiring annotations in pixel level is
expensive and time-consuming. In this work, we demonstrate the feasibility of
training a satisfactory VOS model on sparsely annotated videos-we merely
require two labeled frames per training video while the performance is
sustained. We term this novel training paradigm as two-shot video object
segmentation, or two-shot VOS for short. The underlying idea is to generate
pseudo labels for unlabeled frames during training and to optimize the model on
the combination of labeled and pseudo-labeled data. Our approach is extremely
simple and can be applied to a majority of existing frameworks. We first
pre-train a VOS model on sparsely annotated videos in a semi-supervised manner,
with the first frame always being a labeled one. Then, we adopt the pre-trained
VOS model to generate pseudo labels for all unlabeled frames, which are
subsequently stored in a pseudo-label bank. Finally, we retrain a VOS model on
both labeled and pseudo-labeled data without any restrictions on the first
frame. For the first time, we present a general way to train VOS models on
two-shot VOS datasets. By using 7.3% and 2.9% labeled data of YouTube-VOS and
DAVIS benchmarks, our approach achieves comparable results in contrast to the
counterparts trained on fully labeled set. Code and models are available at
https://github.com/yk-pku/Two-shot-Video-Object-Segmentation.
Related papers
- One-shot Training for Video Object Segmentation [11.52321103793505]
Video Object (VOS) aims to track objects across frames in a video and segment them based on the initial annotated frame of the target objects.
Previous VOS works typically rely on fully annotated videos for training.
We propose a general one-shot training framework for VOS, requiring only a single labeled frame per training video.
arXiv Detail & Related papers (2024-05-22T21:37:08Z) - Multi-View Video-Based Learning: Leveraging Weak Labels for Frame-Level Perception [1.5741307755393597]
We propose a novel learning framework to train a video-based action recognition model with weak labels for frame-level perception.
For training the model using the weak labels, we propose a novel latent loss function.
We also propose a model that uses the view-specific latent embeddings for downstream frame-level action recognition and detection tasks.
arXiv Detail & Related papers (2024-03-18T09:47:41Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders
Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - Learning the What and How of Annotation in Video Object Segmentation [11.012995995497029]
Video Object (VOS) is crucial for several applications, from video editing to video data generation.
Traditional way of annotating objects requires humans to draw detailed segmentation masks on the target objects at each video frame.
We propose EVA-VOS, a human-in-the-loop annotation framework for video object segmentation.
arXiv Detail & Related papers (2023-11-08T00:56:31Z) - Self-supervised and Weakly Supervised Contrastive Learning for
Frame-wise Action Representations [26.09611987412578]
We introduce a new framework of contrastive action representation learning (CARL) to learn frame-wise action representation in a self-supervised or weakly-supervised manner.
Specifically, we introduce a simple but effective video encoder that considers both spatial and temporal context.
Our method outperforms previous state-of-the-art by a large margin for downstream fine-grained action classification and even faster inference.
arXiv Detail & Related papers (2022-12-06T16:42:22Z) - Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z) - Beyond Short Clips: End-to-End Video-Level Learning with Collaborative
Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration.
This enables the learning of long-range dependencies beyond a single clip.
Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z) - Reducing the Annotation Effort for Video Object Segmentation Datasets [50.893073670389164]
densely labeling every frame with pixel masks does not scale to large datasets.
We use a deep convolutional network to automatically create pseudo-labels on a pixel level from much cheaper bounding box annotations.
We obtain the new TAO-VOS benchmark, which we make publicly available at www.vision.rwth-aachen.de/page/taovos.
arXiv Detail & Related papers (2020-11-02T17:34:45Z) - Learning Video Object Segmentation from Unlabeled Videos [158.18207922363783]
We propose a new method for video object segmentation (VOS) that addresses object pattern learning from unlabeled videos.
We introduce a unified unsupervised/weakly supervised learning framework, called MuG, that comprehensively captures properties of VOS at multiple granularities.
arXiv Detail & Related papers (2020-03-10T22:12:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.