Reducing the Annotation Effort for Video Object Segmentation Datasets
- URL: http://arxiv.org/abs/2011.01142v1
- Date: Mon, 2 Nov 2020 17:34:45 GMT
- Title: Reducing the Annotation Effort for Video Object Segmentation Datasets
- Authors: Paul Voigtlaender and Lishu Luo and Chun Yuan and Yong Jiang and
Bastian Leibe
- Abstract summary: densely labeling every frame with pixel masks does not scale to large datasets.
We use a deep convolutional network to automatically create pseudo-labels on a pixel level from much cheaper bounding box annotations.
We obtain the new TAO-VOS benchmark, which we make publicly available at www.vision.rwth-aachen.de/page/taovos.
- Score: 50.893073670389164
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For further progress in video object segmentation (VOS), larger, more
diverse, and more challenging datasets will be necessary. However, densely
labeling every frame with pixel masks does not scale to large datasets. We use
a deep convolutional network to automatically create pseudo-labels on a pixel
level from much cheaper bounding box annotations and investigate how far such
pseudo-labels can carry us for training state-of-the-art VOS approaches. A very
encouraging result of our study is that adding a manually annotated mask in
only a single video frame for each object is sufficient to generate
pseudo-labels which can be used to train a VOS method to reach almost the same
performance level as when training with fully segmented videos. We use this
workflow to create pixel pseudo-labels for the training set of the challenging
tracking dataset TAO, and we manually annotate a subset of the validation set.
Together, we obtain the new TAO-VOS benchmark, which we make publicly available
at www.vision.rwth-aachen.de/page/taovos. While the performance of
state-of-the-art methods on existing datasets starts to saturate, TAO-VOS
remains very challenging for current algorithms and reveals their shortcomings.
Related papers
- Learning Referring Video Object Segmentation from Weak Annotation [78.45828085350936]
Referring video object segmentation (RVOS) is a task that aims to segment the target object in all video frames based on a sentence describing the object.
We propose a new annotation scheme that reduces the annotation effort by 8 times, while providing sufficient supervision for RVOS.
Our scheme only requires a mask for the frame where the object first appears and bounding boxes for the rest of the frames.
arXiv Detail & Related papers (2023-08-04T06:50:52Z) - Two-shot Video Object Segmentation [35.48207692959968]
We train a video object segmentation model on sparsely annotated videos.
We generate pseudo labels for unlabeled frames and optimize the model on the combination of labeled and pseudo-labeled data.
For the first time, we present a general way to train VOS models on two-shot VOS datasets.
arXiv Detail & Related papers (2023-03-21T17:59:56Z) - Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in
Videos [159.02703673838639]
We introduce a method for generating segmentation masks from per-frame bounding box annotations in videos.
We use our resulting accurate masks for weakly supervised training of video object segmentation (VOS) networks.
The additional data provides substantially better generalization performance leading to state-of-the-art results in both the VOS and more challenging tracking domain.
arXiv Detail & Related papers (2021-01-06T18:56:24Z) - Semantics through Time: Semi-supervised Segmentation of Aerial Videos
with Iterative Label Propagation [16.478668565965243]
This paper makes an important step towards automatic annotation by introducing SegProp.
SegProp is a novel iterative flow-based method, with a direct connection to spectral clustering in space and time.
We introduce Ruralscapes, a new dataset with high resolution (4K) images and manually-annotated dense labels every 50 frames.
Our novel SegProp automatically annotates the remaining unlabeled 98% of frames with an accuracy exceeding 90%.
arXiv Detail & Related papers (2020-10-02T15:15:50Z) - Labelling unlabelled videos from scratch with multi-modal
self-supervision [82.60652426371936]
unsupervised labelling of a video dataset does not come for free from strong feature encoders.
We propose a novel clustering method that allows pseudo-labelling of a video dataset without any human annotations.
An extensive analysis shows that the resulting clusters have high semantic overlap to ground truth human labels.
arXiv Detail & Related papers (2020-06-24T12:28:17Z) - Naive-Student: Leveraging Semi-Supervised Learning in Video Sequences
for Urban Scene Segmentation [57.68890534164427]
In this work, we ask if we may leverage semi-supervised learning in unlabeled video sequences and extra images to improve the performance on urban scene segmentation.
We simply predict pseudo-labels for the unlabeled data and train subsequent models with both human-annotated and pseudo-labeled data.
Our Naive-Student model, trained with such simple yet effective iterative semi-supervised learning, attains state-of-the-art results at all three Cityscapes benchmarks.
arXiv Detail & Related papers (2020-05-20T18:00:05Z) - Learning Video Object Segmentation from Unlabeled Videos [158.18207922363783]
We propose a new method for video object segmentation (VOS) that addresses object pattern learning from unlabeled videos.
We introduce a unified unsupervised/weakly supervised learning framework, called MuG, that comprehensively captures properties of VOS at multiple granularities.
arXiv Detail & Related papers (2020-03-10T22:12:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.