Self-attention on Multi-Shifted Windows for Scene Segmentation
- URL: http://arxiv.org/abs/2207.04403v1
- Date: Sun, 10 Jul 2022 07:36:36 GMT
- Title: Self-attention on Multi-Shifted Windows for Scene Segmentation
- Authors: Litao Yu, Zhibin Li, Jian Zhang, Qiang Wu
- Abstract summary: We explore the effective use of self-attention within multi-scale image windows to learn descriptive visual features.
We propose three different strategies to aggregate these feature maps to decode the feature representation for dense prediction.
Our models achieve very promising performance on four public scene segmentation datasets.
- Score: 14.47974086177051
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene segmentation in images is a fundamental yet challenging problem in
visual content understanding, which is to learn a model to assign every image
pixel to a categorical label. One of the challenges for this learning task is
to consider the spatial and semantic relationships to obtain descriptive
feature representations, so learning the feature maps from multiple scales is a
common practice in scene segmentation. In this paper, we explore the effective
use of self-attention within multi-scale image windows to learn descriptive
visual features, then propose three different strategies to aggregate these
feature maps to decode the feature representation for dense prediction. Our
design is based on the recently proposed Swin Transformer models, which totally
discards convolution operations. With the simple yet effective multi-scale
feature learning and aggregation, our models achieve very promising performance
on four public scene segmentation datasets, PASCAL VOC2012, COCO-Stuff 10K,
ADE20K and Cityscapes.
Related papers
- MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - AIMS: All-Inclusive Multi-Level Segmentation [93.5041381700744]
We propose a new task, All-Inclusive Multi-Level (AIMS), which segments visual regions into three levels: part, entity, and relation.
We also build a unified AIMS model through multi-dataset multi-task training to address the two major challenges of annotation inconsistency and task correlation.
arXiv Detail & Related papers (2023-05-28T16:28:49Z) - CSP: Self-Supervised Contrastive Spatial Pre-Training for
Geospatial-Visual Representations [90.50864830038202]
We present Contrastive Spatial Pre-Training (CSP), a self-supervised learning framework for geo-tagged images.
We use a dual-encoder to separately encode the images and their corresponding geo-locations, and use contrastive objectives to learn effective location representations from images.
CSP significantly boosts the model performance with 10-34% relative improvement with various labeled training data sampling ratios.
arXiv Detail & Related papers (2023-05-01T23:11:18Z) - SGMNet: Scene Graph Matching Network for Few-Shot Remote Sensing Scene
Classification [14.016637774748677]
Few-Shot Remote Sensing Scene Classification (FSRSSC) is an important task, which aims to recognize novel scene classes with few examples.
We propose a novel scene graph matching-based meta-learning framework for FSRSSC, called SGMNet.
We conduct extensive experiments on UCMerced LandUse, WHU19, AID, and NWPU-RESISC45 datasets.
arXiv Detail & Related papers (2021-10-09T07:43:40Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z) - Learning to Focus: Cascaded Feature Matching Network for Few-shot Image
Recognition [38.49419948988415]
Deep networks can learn to accurately recognize objects of a category by training on a large number of images.
A meta-learning challenge known as a low-shot image recognition task comes when only a few images with annotations are available for learning a recognition model for one category.
Our method, called Cascaded Feature Matching Network (CFMN), is proposed to solve this problem.
Experiments for few-shot learning on two standard datasets, emphminiImageNet and Omniglot, have confirmed the effectiveness of our method.
arXiv Detail & Related papers (2021-01-13T11:37:28Z) - Group-Wise Semantic Mining for Weakly Supervised Semantic Segmentation [49.90178055521207]
This work addresses weakly supervised semantic segmentation (WSSS), with the goal of bridging the gap between image-level annotations and pixel-level segmentation.
We formulate WSSS as a novel group-wise learning task that explicitly models semantic dependencies in a group of images to estimate more reliable pseudo ground-truths.
In particular, we devise a graph neural network (GNN) for group-wise semantic mining, wherein input images are represented as graph nodes.
arXiv Detail & Related papers (2020-12-09T12:40:13Z) - Multi-layer Feature Aggregation for Deep Scene Parsing Models [19.198074549944568]
In this paper, we explore the effective use of multi-layer feature outputs of the deep parsing networks for spatial-semantic consistency.
The proposed module can auto-select the intermediate visual features to correlate the spatial and semantic information.
Experiments on four public scene parsing datasets prove that the deep parsing network equipped with the proposed feature aggregation module can achieve very promising results.
arXiv Detail & Related papers (2020-11-04T23:07:07Z) - Learning to Segment from Scribbles using Multi-scale Adversarial
Attention Gates [16.28285034098361]
Weakly-supervised learning can train models by relying on weaker forms of annotation, such as scribbles.
We train a multi-scale GAN to generate realistic segmentation masks at multiple resolutions, while we use scribbles to learn their correct position in the image.
Central to the model's success is a novel attention gating mechanism, which we condition with adversarial signals to act as a shape prior.
arXiv Detail & Related papers (2020-07-02T14:39:08Z) - Naive-Student: Leveraging Semi-Supervised Learning in Video Sequences
for Urban Scene Segmentation [57.68890534164427]
In this work, we ask if we may leverage semi-supervised learning in unlabeled video sequences and extra images to improve the performance on urban scene segmentation.
We simply predict pseudo-labels for the unlabeled data and train subsequent models with both human-annotated and pseudo-labeled data.
Our Naive-Student model, trained with such simple yet effective iterative semi-supervised learning, attains state-of-the-art results at all three Cityscapes benchmarks.
arXiv Detail & Related papers (2020-05-20T18:00:05Z) - Realizing Pixel-Level Semantic Learning in Complex Driving Scenes based
on Only One Annotated Pixel per Class [17.481116352112682]
We propose a new semantic segmentation task under complex driving scenes based on weakly supervised condition.
A three step process is built for pseudo labels generation, which progressively implement optimal feature representation for each category.
Experiments on Cityscapes dataset demonstrate that the proposed method provides a feasible way to solve weakly supervised semantic segmentation task.
arXiv Detail & Related papers (2020-03-10T12:57:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.