Dynamic Focus-aware Positional Queries for Semantic Segmentation
- URL: http://arxiv.org/abs/2204.01244v3
- Date: Tue, 28 Mar 2023 02:42:17 GMT
- Title: Dynamic Focus-aware Positional Queries for Semantic Segmentation
- Authors: Haoyu He, Jianfei Cai, Zizheng Pan, Jing Liu, Jing Zhang, Dacheng Tao,
Bohan Zhuang
- Abstract summary: We propose a simple yet effective query design for semantic segmentation termed Dynamic Focus-aware Positional Queries.
Our framework achieves SOTA performance and outperforms Mask2former by clear margins of 1.1%, 1.9%, and 1.1% single-scale mIoU with ResNet-50, Swin-T, and Swin-B backbones.
- Score: 94.6834904076914
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The DETR-like segmentors have underpinned the most recent breakthroughs in
semantic segmentation, which end-to-end train a set of queries representing the
class prototypes or target segments. Recently, masked attention is proposed to
restrict each query to only attend to the foreground regions predicted by the
preceding decoder block for easier optimization. Although promising, it relies
on the learnable parameterized positional queries which tend to encode the
dataset statistics, leading to inaccurate localization for distinct individual
queries. In this paper, we propose a simple yet effective query design for
semantic segmentation termed Dynamic Focus-aware Positional Queries (DFPQ),
which dynamically generates positional queries conditioned on the
cross-attention scores from the preceding decoder block and the positional
encodings for the corresponding image features, simultaneously. Therefore, our
DFPQ preserves rich localization information for the target segments and
provides accurate and fine-grained positional priors. In addition, we propose
to efficiently deal with high-resolution cross-attention by only aggregating
the contextual tokens based on the low-resolution cross-attention scores to
perform local relation aggregation. Extensive experiments on ADE20K and
Cityscapes show that with the two modifications on Mask2former, our framework
achieves SOTA performance and outperforms Mask2former by clear margins of 1.1%,
1.9%, and 1.1% single-scale mIoU with ResNet-50, Swin-T, and Swin-B backbones
on the ADE20K validation set, respectively. Source code is available at
https://github.com/ziplab/FASeg
Related papers
- Bridge the Points: Graph-based Few-shot Segment Anything Semantically [79.1519244940518]
Recent advancements in pre-training techniques have enhanced the capabilities of vision foundation models.
Recent studies extend the SAM to Few-shot Semantic segmentation (FSS)
We propose a simple yet effective approach based on graph analysis.
arXiv Detail & Related papers (2024-10-09T15:02:28Z) - Enhancing DETRs Variants through Improved Content Query and Similar Query Aggregation [27.07277433645018]
We introduce a novel plug-and-play module, Self-Adaptive Content Query (SACQ)
SACQ generates content queries via self-attention pooling.
It allows candidate queries to adapt to the input image, resulting in a more comprehensive content prior and better focus on target objects.
We propose a query aggregation strategy to cooperate with SACQ. It merges similar predicted candidates from different queries, easing the optimization.
arXiv Detail & Related papers (2024-05-06T09:50:04Z) - Optimized Information Flow for Transformer Tracking [0.7199733380797579]
One-stream Transformer trackers have shown outstanding performance in challenging benchmark datasets.
We propose a novel OIFTrack framework to enhance the discriminative capability of the tracker.
arXiv Detail & Related papers (2024-02-13T03:39:15Z) - Temporal-aware Hierarchical Mask Classification for Video Semantic
Segmentation [62.275143240798236]
Video semantic segmentation dataset has limited categories per video.
Less than 10% of queries could be matched to receive meaningful gradient updates during VSS training.
Our method achieves state-of-the-art performance on the latest challenging VSS benchmark VSPW without bells and whistles.
arXiv Detail & Related papers (2023-09-14T20:31:06Z) - Resolution-Aware Design of Atrous Rates for Semantic Segmentation
Networks [7.58745191859815]
DeepLab is a widely used deep neural network for semantic segmentation, whose success is attributed to its parallel architecture called atrous spatial pyramid pooling (ASPP)
fixed values of atrous rates are used for the ASPP module, which restricts the size of its field of view.
This study proposes practical guidelines for obtaining an optimal atrous rate.
arXiv Detail & Related papers (2023-07-26T13:11:48Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - Progressively Dual Prior Guided Few-shot Semantic Segmentation [57.37506990980975]
Few-shot semantic segmentation task aims at performing segmentation in query images with a few annotated support samples.
We propose a progressively dual prior guided few-shot semantic segmentation network.
arXiv Detail & Related papers (2022-11-20T16:19:47Z) - IoU-Enhanced Attention for End-to-End Task Specific Object Detection [17.617133414432836]
R-CNN achieves promising results without densely tiled anchor boxes or grid points in the image.
Due to the sparse nature and the one-to-one relation between the query and its attending region, it heavily depends on the self attention.
This paper proposes to use IoU between different boxes as a prior for the value routing in self attention.
arXiv Detail & Related papers (2022-09-21T14:36:18Z) - Dynamic Prototype Convolution Network for Few-Shot Semantic Segmentation [33.93192093090601]
Key challenge for few-shot semantic segmentation (FSS) is how to tailor a desirable interaction among support and query features.
We propose a prototype prototype convolution network (DPCN) to fully capture the intrinsic details for accurate FSS.
Our DPCN is also flexible and efficient under the k-shot FSS setting.
arXiv Detail & Related papers (2022-04-22T11:12:37Z) - Target-Aware Object Discovery and Association for Unsupervised Video
Multi-Object Segmentation [79.6596425920849]
This paper addresses the task of unsupervised video multi-object segmentation.
We introduce a novel approach for more accurate and efficient unseen-temporal segmentation.
We evaluate the proposed approach on DAVIS$_17$ and YouTube-VIS, and the results demonstrate that it outperforms state-of-the-art methods both in segmentation accuracy and inference speed.
arXiv Detail & Related papers (2021-04-10T14:39:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.