WeClick: Weakly-Supervised Video Semantic Segmentation with Click
Annotations
- URL: http://arxiv.org/abs/2107.03088v1
- Date: Wed, 7 Jul 2021 09:12:46 GMT
- Title: WeClick: Weakly-Supervised Video Semantic Segmentation with Click
Annotations
- Authors: Peidong Liu, Zibin He, Xiyu Yan, Yong Jiang, Shutao Xia, Feng Zheng,
Maowei Hu
- Abstract summary: We propose an effective weakly-supervised video semantic segmentation pipeline with click annotations, called WeClick.
Since detailed semantic information is not captured by clicks, directly training with click labels leads to poor segmentation predictions.
WeClick outperforms the state-of-the-art methods, increases performance by 10.24% mIoU than baseline, and achieves real-time execution.
- Score: 64.52412111417019
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Compared with tedious per-pixel mask annotating, it is much easier to
annotate data by clicks, which costs only several seconds for an image.
However, applying clicks to learn video semantic segmentation model has not
been explored before. In this work, we propose an effective weakly-supervised
video semantic segmentation pipeline with click annotations, called WeClick,
for saving laborious annotating effort by segmenting an instance of the
semantic class with only a single click. Since detailed semantic information is
not captured by clicks, directly training with click labels leads to poor
segmentation predictions. To mitigate this problem, we design a novel memory
flow knowledge distillation strategy to exploit temporal information (named
memory flow) in abundant unlabeled video frames, by distilling the neighboring
predictions to the target frame via estimated motion. Moreover, we adopt
vanilla knowledge distillation for model compression. In this case, WeClick
learns compact video semantic segmentation models with the low-cost click
annotations during the training phase yet achieves real-time and accurate
models during the inference period. Experimental results on Cityscapes and
Camvid show that WeClick outperforms the state-of-the-art methods, increases
performance by 10.24% mIoU than baseline, and achieves real-time execution.
Related papers
- RClicks: Realistic Click Simulation for Benchmarking Interactive Segmentation [37.44155289954746]
We conduct a large crowdsourcing study of click patterns in an interactive segmentation scenario and collect 475K real-user clicks.
Using our model and dataset, we propose RClicks benchmark for a comprehensive comparison of existing interactive segmentation methods on realistic clicks.
According to our benchmark, in real-world usage interactive segmentation models may perform worse than it has been reported in the baseline benchmark, and most of the methods are not robust.
arXiv Detail & Related papers (2024-10-15T15:55:00Z) - IFSENet : Harnessing Sparse Iterations for Interactive Few-shot Segmentation Excellence [2.822194296769473]
Few-shot segmentation techniques reduce the required number of images to learn to segment a new class.
interactive segmentation techniques only focus on incrementally improving the segmentation of one object at a time.
We combine the two concepts to drastically reduce the effort required to train segmentation models for novel classes.
arXiv Detail & Related papers (2024-03-22T10:15:53Z) - Time Does Tell: Self-Supervised Time-Tuning of Dense Image
Representations [79.87044240860466]
We propose a novel approach that incorporates temporal consistency in dense self-supervised learning.
Our approach, which we call time-tuning, starts from image-pretrained models and fine-tunes them with a novel self-supervised temporal-alignment clustering loss on unlabeled videos.
Time-tuning improves the state-of-the-art by 8-10% for unsupervised semantic segmentation on videos and matches it for images.
arXiv Detail & Related papers (2023-08-22T21:28:58Z) - Less than Few: Self-Shot Video Instance Segmentation [50.637278655763616]
We propose to automatically learn to find appropriate support videos given a query.
We tackle, for the first time, video instance segmentation in a self-shot (and few-shot) setting.
We provide strong baseline performances that utilize a novel transformer-based model.
arXiv Detail & Related papers (2022-04-19T13:14:43Z) - Beyond Short Clips: End-to-End Video-Level Learning with Collaborative
Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration.
This enables the learning of long-range dependencies beyond a single clip.
Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z) - Composable Augmentation Encoding for Video Representation Learning [94.2358972764708]
We focus on contrastive methods for self-supervised video representation learning.
A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives.
We propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations.
We show that our method encodes valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.
arXiv Detail & Related papers (2021-04-01T16:48:53Z) - Semi-Supervised Action Recognition with Temporal Contrastive Learning [50.08957096801457]
We learn a two-pathway temporal contrastive model using unlabeled videos at two different speeds.
We considerably outperform video extensions of sophisticated state-of-the-art semi-supervised image recognition methods.
arXiv Detail & Related papers (2021-02-04T17:28:35Z) - Motion2Vec: Semi-Supervised Representation Learning from Surgical Videos [23.153335327822685]
We learn a motion-centric representation of surgical video demonstrations by grouping them into action segments/sub-goals/options.
We present Motion2Vec, an algorithm that learns a deep embedding feature space from video observations.
We demonstrate the use of this representation to imitate surgical suturing motions from publicly available videos of the JIGSAWS dataset.
arXiv Detail & Related papers (2020-05-31T15:46:01Z) - Dropout Prediction over Weeks in MOOCs by Learning Representations of
Clicks and Videos [6.030785848148107]
We develop a method to learn representation for videos and the correlation between videos and clicks.
The results indicate that modeling videos and their correlation with clicks bring statistically significant improvements in predicting dropout.
arXiv Detail & Related papers (2020-02-05T19:10:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.