Self-Supervised Video Object Segmentation via Cutout Prediction and
Tagging
- URL: http://arxiv.org/abs/2204.10846v1
- Date: Fri, 22 Apr 2022 17:53:27 GMT
- Title: Self-Supervised Video Object Segmentation via Cutout Prediction and
Tagging
- Authors: Jyoti Kini and Fahad Shahbaz Khan and Salman Khan and Mubarak Shah
- Abstract summary: We propose a novel self-supervised Video Object (VOS) approach that strives to achieve better object-background discriminability.
Our approach is based on a discriminative learning loss formulation that takes into account both object and background information.
Our proposed approach, CT-VOS, achieves state-of-the-art results on two challenging benchmarks: DAVIS-2017 and Youtube-VOS.
- Score: 117.73967303377381
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We propose a novel self-supervised Video Object Segmentation (VOS) approach
that strives to achieve better object-background discriminability for accurate
object segmentation. Distinct from previous self-supervised VOS methods, our
approach is based on a discriminative learning loss formulation that takes into
account both object and background information to ensure object-background
discriminability, rather than using only object appearance. The discriminative
learning loss comprises cutout-based reconstruction (cutout region represents
part of a frame, whose pixels are replaced with some constant values) and tag
prediction loss terms. The cutout-based reconstruction term utilizes a simple
cutout scheme to learn the pixel-wise correspondence between the current and
previous frames in order to reconstruct the original current frame with added
cutout region in it. The introduced cutout patch guides the model to focus as
much on the significant features of the object of interest as the less
significant ones, thereby implicitly equipping the model to address
occlusion-based scenarios. Next, the tag prediction term encourages
object-background separability by grouping tags of all pixels in the cutout
region that are similar, while separating them from the tags of the rest of the
reconstructed frame pixels. Additionally, we introduce a zoom-in scheme that
addresses the problem of small object segmentation by capturing fine structural
information at multiple scales. Our proposed approach, termed CT-VOS, achieves
state-of-the-art results on two challenging benchmarks: DAVIS-2017 and
Youtube-VOS. A detailed ablation showcases the importance of the proposed loss
formulation to effectively capture object-background discriminability and the
impact of our zoom-in scheme to accurately segment small-sized objects.
Related papers
- Pixel-Level Domain Adaptation: A New Perspective for Enhancing Weakly Supervised Semantic Segmentation [13.948425538725138]
We propose a Pixel-Level Domain Adaptation (PLDA) method to encourage the model in learning pixel-wise domain-invariant features.
We experimentally demonstrate the effectiveness of our approach under a wide range of settings.
arXiv Detail & Related papers (2024-08-04T14:14:54Z) - Spatial Structure Constraints for Weakly Supervised Semantic
Segmentation [100.0316479167605]
A class activation map (CAM) can only locate the most discriminative part of objects.
We propose spatial structure constraints (SSC) for weakly supervised semantic segmentation to alleviate the unwanted object over-activation of attention expansion.
Our approach achieves 72.7% and 47.0% mIoU on the PASCAL VOC 2012 and COCO datasets, respectively.
arXiv Detail & Related papers (2024-01-20T05:25:25Z) - Inter-object Discriminative Graph Modeling for Indoor Scene Recognition [5.712940060321454]
We propose to leverage discriminative object knowledge to enhance scene feature representations.
We construct a Discriminative Graph Network (DGN) in which pixel-level scene features are defined as nodes.
With the proposed IODP and DGN, we obtain state-of-the-art results on several widely used scene datasets.
arXiv Detail & Related papers (2023-11-10T08:07:16Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - Sharp Eyes: A Salient Object Detector Working The Same Way as Human
Visual Characteristics [3.222802562733787]
We propose a sharp eyes network (SENet) that first seperates the object from scene, and then finely segments it.
The proposed method aims to utilize the expanded objects to guide the network obtain complete prediction.
arXiv Detail & Related papers (2023-01-18T11:00:45Z) - Anti-Adversarially Manipulated Attributions for Weakly Supervised
Semantic Segmentation and Object Localization [31.69344455448125]
We present an attribution map of an image that is manipulated to increase the classification score produced by a classifier before the final softmax or sigmoid layer.
This manipulation is realized in an anti-adversarial manner, so that the original image is perturbed along pixel gradients in directions opposite to those used in an adversarial attack.
In addition, we introduce a new regularization procedure that inhibits the incorrect attribution of regions unrelated to the target object and the excessive concentration of attributions on a small region of the target object.
arXiv Detail & Related papers (2022-04-11T06:18:02Z) - High-resolution Iterative Feedback Network for Camouflaged Object
Detection [128.893782016078]
Spotting camouflaged objects that are visually assimilated into the background is tricky for object detection algorithms.
We aim to extract the high-resolution texture details to avoid the detail degradation that causes blurred vision in edges and boundaries.
We introduce a novel HitNet to refine the low-resolution representations by high-resolution features in an iterative feedback manner.
arXiv Detail & Related papers (2022-03-22T11:20:21Z) - Unsupervised Part Discovery from Contrastive Reconstruction [90.88501867321573]
The goal of self-supervised visual representation learning is to learn strong, transferable image representations.
We propose an unsupervised approach to object part discovery and segmentation.
Our method yields semantic parts consistent across fine-grained but visually distinct categories.
arXiv Detail & Related papers (2021-11-11T17:59:42Z) - Self-supervised Segmentation via Background Inpainting [96.10971980098196]
We introduce a self-supervised detection and segmentation approach that can work with single images captured by a potentially moving camera.
We exploit a self-supervised loss function that we exploit to train a proposal-based segmentation network.
We apply our method to human detection and segmentation in images that visually depart from those of standard benchmarks and outperform existing self-supervised methods.
arXiv Detail & Related papers (2020-11-11T08:34:40Z) - A Weakly-Supervised Semantic Segmentation Approach based on the Centroid
Loss: Application to Quality Control and Inspection [6.101839518775968]
We propose and assess a new weakly-supervised semantic segmentation approach making use of a novel loss function.
The performance of the approach is evaluated against datasets from two different industry-related case studies.
arXiv Detail & Related papers (2020-10-26T09:08:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.