Differentiable Soft-Masked Attention
- URL: http://arxiv.org/abs/2206.00182v1
- Date: Wed, 1 Jun 2022 02:05:13 GMT
- Title: Differentiable Soft-Masked Attention
- Authors: Ali Athar, Jonathon Luiten, Alexander Hermans, Deva Ramanan, Bastian
Leibe
- Abstract summary: "Differentiable Soft-Masked Attention" is used for the task of WeaklySupervised Video Object.
We develop a transformer-based network for training, but can also benefit from cycle consistency training on a video with just one annotated frame.
- Score: 115.5770357189209
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have become prevalent in computer vision due to their
performance and flexibility in modelling complex operations. Of particular
significance is the 'cross-attention' operation, which allows a vector
representation (e.g. of an object in an image) to be learned by attending to an
arbitrarily sized set of input features. Recently, "Masked Attention" was
proposed in which a given object representation only attends to those image
pixel features for which the segmentation mask of that object is active. This
specialization of attention proved beneficial for various image and video
segmentation tasks. In this paper, we propose another specialization of
attention which enables attending over `soft-masks' (those with continuous mask
probabilities instead of binary values), and is also differentiable through
these mask probabilities, thus allowing the mask used for attention to be
learned within the network without requiring direct loss supervision. This can
be useful for several applications. Specifically, we employ our "Differentiable
Soft-Masked Attention" for the task of Weakly-Supervised Video Object
Segmentation (VOS), where we develop a transformer-based network for VOS which
only requires a single annotated image frame for training, but can also benefit
from cycle consistency training on a video with just one annotated frame.
Although there is no loss for masks in unlabeled frames, the network is still
able to segment objects in those frames due to our novel attention formulation.
Related papers
- Variance-insensitive and Target-preserving Mask Refinement for
Interactive Image Segmentation [68.16510297109872]
Point-based interactive image segmentation can ease the burden of mask annotation in applications such as semantic segmentation and image editing.
We introduce a novel method, Variance-Insensitive and Target-Preserving Mask Refinement to enhance segmentation quality with fewer user inputs.
Experiments on GrabCut, Berkeley, SBD, and DAVIS datasets demonstrate our method's state-of-the-art performance in interactive image segmentation.
arXiv Detail & Related papers (2023-12-22T02:31:31Z) - Segment (Almost) Nothing: Prompt-Agnostic Adversarial Attacks on
Segmentation Models [61.46999584579775]
General purpose segmentation models are able to generate (semantic) segmentation masks from a variety of prompts.
In particular, input images are pre-processed by an image encoder to obtain embedding vectors which are later used for mask predictions.
We show that even imperceptible perturbations of radius $epsilon=1/255$ are often sufficient to drastically modify the masks predicted with point, box and text prompts.
arXiv Detail & Related papers (2023-11-24T12:57:34Z) - Siamese Masked Autoencoders [76.35448665609998]
We present Siamese Masked Autoencoders (SiamMAE) for learning visual correspondence from videos.
SiamMAE operates on pairs of randomly sampled video frames and asymmetrically masks them.
It outperforms state-of-the-art self-supervised methods on video object segmentation, pose keypoint propagation, and semantic part propagation tasks.
arXiv Detail & Related papers (2023-05-23T17:59:46Z) - GANSeg: Learning to Segment by Unsupervised Hierarchical Image
Generation [16.900404701997502]
We propose a GAN-based approach that generates images conditioned on latent masks.
We show that such mask-conditioned image generation can be learned faithfully when conditioning the masks in a hierarchical manner.
It also lets us generate image-mask pairs for training a segmentation network, which outperforms the state-of-the-art unsupervised segmentation methods on established benchmarks.
arXiv Detail & Related papers (2021-12-02T07:57:56Z) - Learning To Segment Dominant Object Motion From Watching Videos [72.57852930273256]
We envision a simple framework for dominant moving object segmentation that neither requires annotated data to train nor relies on saliency priors or pre-trained optical flow maps.
Inspired by a layered image representation, we introduce a technique to group pixel regions according to their affine parametric motion.
This enables our network to learn segmentation of the dominant foreground object using only RGB image pairs as input for both training and inference.
arXiv Detail & Related papers (2021-11-28T14:51:00Z) - Open-Vocabulary Instance Segmentation via Robust Cross-Modal
Pseudo-Labeling [61.03262873980619]
Open-vocabulary instance segmentation aims at segmenting novel classes without mask annotations.
We propose a cross-modal pseudo-labeling framework, which generates training pseudo masks by aligning word semantics in captions with visual features of object masks in images.
Our framework is capable of labeling novel classes in captions via their word semantics to self-train a student model.
arXiv Detail & Related papers (2021-11-24T18:50:47Z) - Instance Semantic Segmentation Benefits from Generative Adversarial
Networks [13.295723883560122]
We define the problem of predicting masks as a GANs game framework.
A segmentation network generates the masks, and a discriminator network decides on the quality of the masks.
We report on cellphone recycling, autonomous driving, large-scale object detection, and medical glands.
arXiv Detail & Related papers (2020-10-26T17:47:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.