Reliability-Hierarchical Memory Network for Scribble-Supervised Video
Object Segmentation
- URL: http://arxiv.org/abs/2303.14384v1
- Date: Sat, 25 Mar 2023 07:21:40 GMT
- Title: Reliability-Hierarchical Memory Network for Scribble-Supervised Video
Object Segmentation
- Authors: Zikun Zhou, Kaige Mao, Wenjie Pei, Hongpeng Wang, Yaowei Wang, Zhenyu
He
- Abstract summary: This paper aims to solve the video object segmentation (VOS) task in a scribble-supervised manner.
We propose a scribble-supervised learning mechanism to facilitate the learning of our model to predict dense results.
- Score: 25.59883486325534
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper aims to solve the video object segmentation (VOS) task in a
scribble-supervised manner, in which VOS models are not only trained by the
sparse scribble annotations but also initialized with the sparse target
scribbles for inference. Thus, the annotation burdens for both training and
initialization can be substantially lightened. The difficulties of
scribble-supervised VOS lie in two aspects. On the one hand, it requires the
powerful ability to learn from the sparse scribble annotations during training.
On the other hand, it demands strong reasoning capability during inference
given only a sparse initial target scribble. In this work, we propose a
Reliability-Hierarchical Memory Network (RHMNet) to predict the target mask in
a step-wise expanding strategy w.r.t. the memory reliability level. To be
specific, RHMNet first only uses the memory in the high-reliability level to
locate the region with high reliability belonging to the target, which is
highly similar to the initial target scribble. Then it expands the located
high-reliability region to the entire target conditioned on the region itself
and the memories in all reliability levels. Besides, we propose a
scribble-supervised learning mechanism to facilitate the learning of our model
to predict dense results. It mines the pixel-level relation within the single
frame and the frame-level relation within the sequence to take full advantage
of the scribble annotations in sequence training samples. The favorable
performance on two popular benchmarks demonstrates that our method is
promising.
Related papers
- A Fresh Take on Stale Embeddings: Improving Dense Retriever Training with Corrector Networks [81.2624272756733]
In dense retrieval, deep encoders provide embeddings for both inputs and targets.
We train a small parametric corrector network that adjusts stale cached target embeddings.
Our approach matches state-of-the-art results even when no target embedding updates are made during training.
arXiv Detail & Related papers (2024-09-03T13:29:13Z) - Improving Weakly-Supervised Object Localization Using Adversarial Erasing and Pseudo Label [7.400926717561454]
This paper investigates a framework for weakly-supervised object localization.
It aims to train a neural network capable of predicting both the object class and its location using only images and their image-level class labels.
arXiv Detail & Related papers (2024-04-15T06:02:09Z) - Semi-supervised Semantic Segmentation Meets Masked Modeling:Fine-grained
Locality Learning Matters in Consistency Regularization [31.333862320143968]
Semi-supervised semantic segmentation aims to utilize limited labeled images and abundant unlabeled images to achieve label-efficient learning.
We propose a novel framework called textttMaskMatch, which enables fine-grained locality learning to achieve better dense segmentation.
arXiv Detail & Related papers (2023-12-14T03:28:53Z) - 2D Feature Distillation for Weakly- and Semi-Supervised 3D Semantic
Segmentation [92.17700318483745]
We propose an image-guidance network (IGNet) which builds upon the idea of distilling high level feature information from a domain adapted synthetically trained 2D semantic segmentation network.
IGNet achieves state-of-the-art results for weakly-supervised LiDAR semantic segmentation on ScribbleKITTI, boasting up to 98% relative performance to fully supervised training with only 8% labeled points.
arXiv Detail & Related papers (2023-11-27T07:57:29Z) - Learning Referring Video Object Segmentation from Weak Annotation [78.45828085350936]
Referring video object segmentation (RVOS) is a task that aims to segment the target object in all video frames based on a sentence describing the object.
We propose a new annotation scheme that reduces the annotation effort by 8 times, while providing sufficient supervision for RVOS.
Our scheme only requires a mask for the frame where the object first appears and bounding boxes for the rest of the frames.
arXiv Detail & Related papers (2023-08-04T06:50:52Z) - Learning to Learn Better for Video Object Segmentation [94.5753973590207]
We propose a novel framework that emphasizes Learning to Learn Better (LLB) target features for SVOS.
We design the discriminative label generation module (DLGM) and the adaptive fusion module to address these issues.
Our proposed LLB method achieves state-of-the-art performance.
arXiv Detail & Related papers (2022-12-05T09:10:34Z) - Boundary-aware Self-supervised Learning for Video Scene Segmentation [20.713635723315527]
Video scene segmentation is a task of temporally localizing scene boundaries in a video.
We introduce three novel boundary-aware pretext tasks: Shot-Scene Matching, Contextual Group Matching and Pseudo-boundary Prediction.
We achieve the new state-of-the-art on the MovieNet-SSeg benchmark.
arXiv Detail & Related papers (2022-01-14T02:14:07Z) - Learning Position and Target Consistency for Memory-based Video Object
Segmentation [39.787966275016906]
Learn position and target consistency framework for memory-based video object segmentation.
It applies the memory mechanism to retrieve pixels globally, and meanwhile learns position consistency for more reliable segmentation.
Experiments show that our LCM achieves state-of-the-art performance on both DAVIS and Youtube-VOS benchmark.
arXiv Detail & Related papers (2021-04-09T12:22:37Z) - Learning What to Learn for Video Object Segmentation [157.4154825304324]
We introduce an end-to-end trainable VOS architecture that integrates a differentiable few-shot learning module.
This internal learner is designed to predict a powerful parametric model of the target.
We set a new state-of-the-art on the large-scale YouTube-VOS 2018 dataset by achieving an overall score of 81.5.
arXiv Detail & Related papers (2020-03-25T17:58:43Z) - Towards Using Count-level Weak Supervision for Crowd Counting [55.58468947486247]
This paper studies the problem of weakly-supervised crowd counting which learns a model from only a small amount of location-level annotations (fully-supervised) but a large amount of count-level annotations (weakly-supervised)
We devise a simple-yet-effective training strategy, namely Multiple Auxiliary Tasks Training (MATT), to construct regularizes for restricting the freedom of the generated density maps.
arXiv Detail & Related papers (2020-02-29T02:58:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.