AttEntropy: Segmenting Unknown Objects in Complex Scenes using the
Spatial Attention Entropy of Semantic Segmentation Transformers
- URL: http://arxiv.org/abs/2212.14397v1
- Date: Thu, 29 Dec 2022 18:07:56 GMT
- Title: AttEntropy: Segmenting Unknown Objects in Complex Scenes using the
Spatial Attention Entropy of Semantic Segmentation Transformers
- Authors: Krzysztof Lis, Matthias Rottmann, Sina Honari, Pascal Fua, Mathieu
Salzmann
- Abstract summary: We study the spatial attentions of different backbone layers of semantic segmentation transformers.
We exploit this by extracting heatmaps that can be used to segment unknown objects within diverse backgrounds.
Our method is training-free and its computational overhead negligible.
- Score: 99.22536338338011
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision transformers have emerged as powerful tools for many computer vision
tasks. It has been shown that their features and class tokens can be used for
salient object segmentation. However, the properties of segmentation
transformers remain largely unstudied. In this work we conduct an in-depth
study of the spatial attentions of different backbone layers of semantic
segmentation transformers and uncover interesting properties.
The spatial attentions of a patch intersecting with an object tend to
concentrate within the object, whereas the attentions of larger, more uniform
image areas rather follow a diffusive behavior. In other words, vision
transformers trained to segment a fixed set of object classes generalize to
objects well beyond this set. We exploit this by extracting heatmaps that can
be used to segment unknown objects within diverse backgrounds, such as
obstacles in traffic scenes.
Our method is training-free and its computational overhead negligible. We use
off-the-shelf transformers trained for street-scene segmentation to process
other scene types.
Related papers
- Open-World Semantic Segmentation Including Class Similarity [31.799000996671975]
This paper tackles open-world semantic segmentation, i.e., the variant of interpreting image data in which objects occur that have not been seen during training.
We propose a novel approach that performs accurate closed-world semantic segmentation and can identify new categories without requiring any additional training data.
arXiv Detail & Related papers (2024-03-12T11:11:19Z) - Optical Flow boosts Unsupervised Localization and Segmentation [22.625511865323183]
We propose a new loss term formulation that uses optical flow in unlabeled videos to encourage self-supervised ViT features to become closer to each other.
We use the proposed loss function to finetune vision transformers that were originally trained on static images.
arXiv Detail & Related papers (2023-07-25T16:45:35Z) - ExpPoint-MAE: Better interpretability and performance for self-supervised point cloud transformers [7.725095281624494]
We evaluate the effectiveness of Masked Autoencoding as a pretraining scheme, and explore Momentum Contrast as an alternative.
We observe that the transformer learns to attend to semantically meaningful regions, indicating that pretraining leads to a better understanding of the underlying geometry.
arXiv Detail & Related papers (2023-06-19T09:38:21Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - Learning Explicit Object-Centric Representations with Vision
Transformers [81.38804205212425]
We build on the self-supervision task of masked autoencoding and explore its effectiveness for learning object-centric representations with transformers.
We show that the model efficiently learns to decompose simple scenes as measured by segmentation metrics on several multi-object benchmarks.
arXiv Detail & Related papers (2022-10-25T16:39:49Z) - Position Prediction as an Effective Pretraining Strategy [20.925906203643883]
We propose a novel, but surprisingly simple alternative to content reconstruction-- that of predicting locations from content, without providing positional information for it.
Our approach brings improvements over strong supervised training baselines and is comparable to modern unsupervised/self-supervised pretraining methods.
arXiv Detail & Related papers (2022-07-15T17:10:48Z) - Learning To Segment Dominant Object Motion From Watching Videos [72.57852930273256]
We envision a simple framework for dominant moving object segmentation that neither requires annotated data to train nor relies on saliency priors or pre-trained optical flow maps.
Inspired by a layered image representation, we introduce a technique to group pixel regions according to their affine parametric motion.
This enables our network to learn segmentation of the dominant foreground object using only RGB image pairs as input for both training and inference.
arXiv Detail & Related papers (2021-11-28T14:51:00Z) - Segmenter: Transformer for Semantic Segmentation [79.9887988699159]
We introduce Segmenter, a transformer model for semantic segmentation.
We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation.
It outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes.
arXiv Detail & Related papers (2021-05-12T13:01:44Z) - CrossTransformers: spatially-aware few-shot transfer [92.33252608837947]
Given new tasks with very little data, modern vision systems degrade remarkably quickly.
We show how the neural network representations which underpin modern vision systems are subject to supervision collapse.
We propose self-supervised learning to encourage general-purpose features that transfer better.
arXiv Detail & Related papers (2020-07-22T15:37:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.