AttEntropy: Segmenting Unknown Objects in Complex Scenes using the
Spatial Attention Entropy of Semantic Segmentation Transformers
- URL: http://arxiv.org/abs/2212.14397v1
- Date: Thu, 29 Dec 2022 18:07:56 GMT
- Title: AttEntropy: Segmenting Unknown Objects in Complex Scenes using the
Spatial Attention Entropy of Semantic Segmentation Transformers
- Authors: Krzysztof Lis, Matthias Rottmann, Sina Honari, Pascal Fua, Mathieu
Salzmann
- Abstract summary: We study the spatial attentions of different backbone layers of semantic segmentation transformers.
We exploit this by extracting heatmaps that can be used to segment unknown objects within diverse backgrounds.
Our method is training-free and its computational overhead negligible.
- Score: 99.22536338338011
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision transformers have emerged as powerful tools for many computer vision
tasks. It has been shown that their features and class tokens can be used for
salient object segmentation. However, the properties of segmentation
transformers remain largely unstudied. In this work we conduct an in-depth
study of the spatial attentions of different backbone layers of semantic
segmentation transformers and uncover interesting properties.
The spatial attentions of a patch intersecting with an object tend to
concentrate within the object, whereas the attentions of larger, more uniform
image areas rather follow a diffusive behavior. In other words, vision
transformers trained to segment a fixed set of object classes generalize to
objects well beyond this set. We exploit this by extracting heatmaps that can
be used to segment unknown objects within diverse backgrounds, such as
obstacles in traffic scenes.
Our method is training-free and its computational overhead negligible. We use
off-the-shelf transformers trained for street-scene segmentation to process
other scene types.
Related papers
- Self-supervised Object-Centric Learning for Videos [39.02148880719576]
We propose the first fully unsupervised method for segmenting multiple objects in real-world sequences.
Our object-centric learning framework spatially binds objects to slots on each frame and then relates these slots across frames.
Our method can successfully segment multiple instances of complex and high-variety classes in YouTube videos.
arXiv Detail & Related papers (2023-10-10T18:03:41Z) - Optical Flow boosts Unsupervised Localization and Segmentation [22.625511865323183]
We propose a new loss term formulation that uses optical flow in unlabeled videos to encourage self-supervised ViT features to become closer to each other.
We use the proposed loss function to finetune vision transformers that were originally trained on static images.
arXiv Detail & Related papers (2023-07-25T16:45:35Z) - Transformer-Based Visual Segmentation: A Survey [118.01564082499948]
Visual segmentation seeks to partition images, video frames, or point clouds into multiple segments or groups.
Transformers are a type of neural network based on self-attention originally designed for natural language processing.
Transformers offer robust, unified, and even simpler solutions for various segmentation tasks.
arXiv Detail & Related papers (2023-04-19T17:59:02Z) - Learning Explicit Object-Centric Representations with Vision
Transformers [81.38804205212425]
We build on the self-supervision task of masked autoencoding and explore its effectiveness for learning object-centric representations with transformers.
We show that the model efficiently learns to decompose simple scenes as measured by segmentation metrics on several multi-object benchmarks.
arXiv Detail & Related papers (2022-10-25T16:39:49Z) - Deep Spectral Methods: A Surprisingly Strong Baseline for Unsupervised
Semantic Segmentation and Localization [98.46318529630109]
We take inspiration from traditional spectral segmentation methods by reframing image decomposition as a graph partitioning problem.
We find that these eigenvectors already decompose an image into meaningful segments, and can be readily used to localize objects in a scene.
By clustering the features associated with these segments across a dataset, we can obtain well-delineated, nameable regions.
arXiv Detail & Related papers (2022-05-16T17:47:44Z) - Deep ViT Features as Dense Visual Descriptors [12.83702462166513]
We leverage deep features extracted from a pre-trained Vision Transformer (ViT) as dense visual descriptors.
These descriptors facilitate a variety of applications, including co-segmentation, part co-segmentation and correspondences.
arXiv Detail & Related papers (2021-12-10T20:15:03Z) - Segmenter: Transformer for Semantic Segmentation [79.9887988699159]
We introduce Segmenter, a transformer model for semantic segmentation.
We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation.
It outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes.
arXiv Detail & Related papers (2021-05-12T13:01:44Z) - DyStaB: Unsupervised Object Segmentation via Dynamic-Static
Bootstrapping [72.84991726271024]
We describe an unsupervised method to detect and segment portions of images of live scenes that are seen moving as a coherent whole.
Our method first partitions the motion field by minimizing the mutual information between segments.
It uses the segments to learn object models that can be used for detection in a static image.
arXiv Detail & Related papers (2020-08-16T22:05:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.