Naive-Student: Leveraging Semi-Supervised Learning in Video Sequences
for Urban Scene Segmentation
- URL: http://arxiv.org/abs/2005.10266v4
- Date: Mon, 20 Jul 2020 03:40:38 GMT
- Title: Naive-Student: Leveraging Semi-Supervised Learning in Video Sequences
for Urban Scene Segmentation
- Authors: Liang-Chieh Chen, Raphael Gontijo Lopes, Bowen Cheng, Maxwell D.
Collins, Ekin D. Cubuk, Barret Zoph, Hartwig Adam, Jonathon Shlens
- Abstract summary: In this work, we ask if we may leverage semi-supervised learning in unlabeled video sequences and extra images to improve the performance on urban scene segmentation.
We simply predict pseudo-labels for the unlabeled data and train subsequent models with both human-annotated and pseudo-labeled data.
Our Naive-Student model, trained with such simple yet effective iterative semi-supervised learning, attains state-of-the-art results at all three Cityscapes benchmarks.
- Score: 57.68890534164427
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Supervised learning in large discriminative models is a mainstay for modern
computer vision. Such an approach necessitates investing in large-scale
human-annotated datasets for achieving state-of-the-art results. In turn, the
efficacy of supervised learning may be limited by the size of the human
annotated dataset. This limitation is particularly notable for image
segmentation tasks, where the expense of human annotation is especially large,
yet large amounts of unlabeled data may exist. In this work, we ask if we may
leverage semi-supervised learning in unlabeled video sequences and extra images
to improve the performance on urban scene segmentation, simultaneously tackling
semantic, instance, and panoptic segmentation. The goal of this work is to
avoid the construction of sophisticated, learned architectures specific to
label propagation (e.g., patch matching and optical flow). Instead, we simply
predict pseudo-labels for the unlabeled data and train subsequent models with
both human-annotated and pseudo-labeled data. The procedure is iterated for
several times. As a result, our Naive-Student model, trained with such simple
yet effective iterative semi-supervised learning, attains state-of-the-art
results at all three Cityscapes benchmarks, reaching the performance of 67.8%
PQ, 42.6% AP, and 85.2% mIOU on the test set. We view this work as a notable
step towards building a simple procedure to harness unlabeled video sequences
and extra images to surpass state-of-the-art performance on core computer
vision tasks.
Related papers
- HomE: Homography-Equivariant Video Representation Learning [62.89516761473129]
We propose a novel method for representation learning of multi-view videos.
Our method learns an implicit mapping between different views, culminating in a representation space that maintains the homography relationship between neighboring views.
On action classification, our method obtains 96.4% 3-fold accuracy on the UCF101 dataset, better than most state-of-the-art self-supervised learning methods.
arXiv Detail & Related papers (2023-06-02T15:37:43Z) - iBoot: Image-bootstrapped Self-Supervised Video Representation Learning [45.845595749486215]
Video self-supervised learning (SSL) suffers from added challenges: video datasets are typically not as large as image datasets.
We propose to utilize a strong image-based model, pre-trained with self- or language supervision, in a video representation learning framework.
The proposed algorithm is shown to learn much more efficiently in less epochs and with a smaller batch.
arXiv Detail & Related papers (2022-06-16T17:42:48Z) - A Pixel-Level Meta-Learner for Weakly Supervised Few-Shot Semantic
Segmentation [40.27705176115985]
Few-shot semantic segmentation addresses the learning task in which only few images with ground truth pixel-level labels are available for the novel classes of interest.
We propose a novel meta-learning framework, which predicts pseudo pixel-level segmentation masks from a limited amount of data and their semantic labels.
Our proposed learning model can be viewed as a pixel-level meta-learner.
arXiv Detail & Related papers (2021-11-02T08:28:11Z) - Hierarchical Self-Supervised Learning for Medical Image Segmentation
Based on Multi-Domain Data Aggregation [23.616336382437275]
We propose Hierarchical Self-Supervised Learning (HSSL) for medical image segmentation.
We first aggregate a dataset from several medical challenges, then pre-train the network in a self-supervised manner, and finally fine-tune on labeled data.
Compared to learning from scratch, our new method yields better performance on various tasks.
arXiv Detail & Related papers (2021-07-10T18:17:57Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Learning to Track Instances without Video Annotations [85.9865889886669]
We introduce a novel semi-supervised framework by learning instance tracking networks with only a labeled image dataset and unlabeled video sequences.
We show that even when only trained with images, the learned feature representation is robust to instance appearance variations.
In addition, we integrate this module into single-stage instance segmentation and pose estimation frameworks.
arXiv Detail & Related papers (2021-04-01T06:47:41Z) - Visual Distant Supervision for Scene Graph Generation [66.10579690929623]
Scene graph models usually require supervised learning on large quantities of labeled data with intensive human annotation.
We propose visual distant supervision, a novel paradigm of visual relation learning, which can train scene graph models without any human-labeled data.
Comprehensive experimental results show that our distantly supervised model outperforms strong weakly supervised and semi-supervised baselines.
arXiv Detail & Related papers (2021-03-29T06:35:24Z) - Improving Semantic Segmentation via Self-Training [75.07114899941095]
We show that we can obtain state-of-the-art results using a semi-supervised approach, specifically a self-training paradigm.
We first train a teacher model on labeled data, and then generate pseudo labels on a large set of unlabeled data.
Our robust training framework can digest human-annotated and pseudo labels jointly and achieve top performances on Cityscapes, CamVid and KITTI datasets.
arXiv Detail & Related papers (2020-04-30T17:09:17Z) - Semi-supervised few-shot learning for medical image segmentation [21.349705243254423]
Recent attempts to alleviate the need for large annotated datasets have developed training strategies under the few-shot learning paradigm.
We propose a novel few-shot learning framework for semantic segmentation, where unlabeled images are also made available at each episode.
We show that including unlabeled surrogate tasks in the episodic training leads to more powerful feature representations.
arXiv Detail & Related papers (2020-03-18T20:37:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.