Related papers: Differentiable Soft-Masked Attention

Differentiable Soft-Masked Attention

URL: http://arxiv.org/abs/2206.00182v1
Date: Wed, 1 Jun 2022 02:05:13 GMT
Title: Differentiable Soft-Masked Attention
Authors: Ali Athar, Jonathon Luiten, Alexander Hermans, Deva Ramanan, Bastian Leibe
Abstract summary: "Differentiable Soft-Masked Attention" is used for the task of WeaklySupervised Video Object. We develop a transformer-based network for training, but can also benefit from cycle consistency training on a video with just one annotated frame.
Score: 115.5770357189209
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformers have become prevalent in computer vision due to their performance and flexibility in modelling complex operations. Of particular significance is the 'cross-attention' operation, which allows a vector representation (e.g. of an object in an image) to be learned by attending to an arbitrarily sized set of input features. Recently, "Masked Attention" was proposed in which a given object representation only attends to those image pixel features for which the segmentation mask of that object is active. This specialization of attention proved beneficial for various image and video segmentation tasks. In this paper, we propose another specialization of attention which enables attending over `soft-masks' (those with continuous mask probabilities instead of binary values), and is also differentiable through these mask probabilities, thus allowing the mask used for attention to be learned within the network without requiring direct loss supervision. This can be useful for several applications. Specifically, we employ our "Differentiable Soft-Masked Attention" for the task of Weakly-Supervised Video Object Segmentation (VOS), where we develop a transformer-based network for VOS which only requires a single annotated image frame for training, but can also benefit from cycle consistency training on a video with just one annotated frame. Although there is no loss for masks in unlabeled frames, the network is still able to segment objects in those frames due to our novel attention formulation.

Related papers

SeeDiff: Off-the-Shelf Seeded Mask Generation from Diffusion Models [6.0870128457015715]
We show that cross-attention alone provides very coarse object localization, which however can provide initial seeds.<n>We also observe that a simple-text-guided synthetic image often has a uniform background, which is easier to find correspondences.<n>Our proposed method, dubbed SeeDiff, generates high-quality masks off-the-shelf from Stable Diffusion.
arXiv Detail & Related papers (2025-07-26T05:44:00Z)
HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model [6.641903410779405]
We propose the Hierarchical Mask Tokenizer (HiMTok), which represents segmentation masks with up to 32 tokens. HiMTok allows for compact and coarse-to-fine mask representations, aligning well with the next-token-prediction paradigm. We develop a 3-stage training recipe for progressive learning of segmentation and visual capabilities, featuring a hierarchical mask loss for effective coarse-to-fine learning.
arXiv Detail & Related papers (2025-03-17T10:29:08Z)
SMITE: Segment Me In TimE [35.56475607621353]
We show how to segment an object in a video by employing a pre-trained text to image diffusion model and an additional tracking mechanism. We demonstrate that our approach can effectively manage various segmentation scenarios and outperforms state-of-the-art alternatives.
arXiv Detail & Related papers (2024-10-24T08:38:20Z)
LAC-Net: Linear-Fusion Attention-Guided Convolutional Network for Accurate Robotic Grasping Under the Occlusion [79.22197702626542]
This paper introduces a framework that explores amodal segmentation for robotic grasping in cluttered scenes. We propose a Linear-fusion Attention-guided Convolutional Network (LAC-Net) The results on different datasets show that our method achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-08-06T14:50:48Z)
Segment (Almost) Nothing: Prompt-Agnostic Adversarial Attacks on Segmentation Models [61.46999584579775]
General purpose segmentation models are able to generate (semantic) segmentation masks from a variety of prompts. In particular, input images are pre-processed by an image encoder to obtain embedding vectors which are later used for mask predictions. We show that even imperceptible perturbations of radius $epsilon=1/255$ are often sufficient to drastically modify the masks predicted with point, box and text prompts.
arXiv Detail & Related papers (2023-11-24T12:57:34Z)
Siamese Masked Autoencoders [76.35448665609998]
We present Siamese Masked Autoencoders (SiamMAE) for learning visual correspondence from videos. SiamMAE operates on pairs of randomly sampled video frames and asymmetrically masks them. It outperforms state-of-the-art self-supervised methods on video object segmentation, pose keypoint propagation, and semantic part propagation tasks.
arXiv Detail & Related papers (2023-05-23T17:59:46Z)
GANSeg: Learning to Segment by Unsupervised Hierarchical Image Generation [16.900404701997502]
We propose a GAN-based approach that generates images conditioned on latent masks. We show that such mask-conditioned image generation can be learned faithfully when conditioning the masks in a hierarchical manner. It also lets us generate image-mask pairs for training a segmentation network, which outperforms the state-of-the-art unsupervised segmentation methods on established benchmarks.
arXiv Detail & Related papers (2021-12-02T07:57:56Z)
Learning To Segment Dominant Object Motion From Watching Videos [72.57852930273256]
We envision a simple framework for dominant moving object segmentation that neither requires annotated data to train nor relies on saliency priors or pre-trained optical flow maps. Inspired by a layered image representation, we introduce a technique to group pixel regions according to their affine parametric motion. This enables our network to learn segmentation of the dominant foreground object using only RGB image pairs as input for both training and inference.
arXiv Detail & Related papers (2021-11-28T14:51:00Z)
Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling [61.03262873980619]
Open-vocabulary instance segmentation aims at segmenting novel classes without mask annotations. We propose a cross-modal pseudo-labeling framework, which generates training pseudo masks by aligning word semantics in captions with visual features of object masks in images. Our framework is capable of labeling novel classes in captions via their word semantics to self-train a student model.
arXiv Detail & Related papers (2021-11-24T18:50:47Z)
Instance Semantic Segmentation Benefits from Generative Adversarial Networks [13.295723883560122]
We define the problem of predicting masks as a GANs game framework. A segmentation network generates the masks, and a discriminator network decides on the quality of the masks. We report on cellphone recycling, autonomous driving, large-scale object detection, and medical glands.
arXiv Detail & Related papers (2020-10-26T17:47:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.