Cost Aggregation with 4D Convolutional Swin Transformer for Few-Shot
Segmentation
- URL: http://arxiv.org/abs/2207.10866v1
- Date: Fri, 22 Jul 2022 04:10:30 GMT
- Title: Cost Aggregation with 4D Convolutional Swin Transformer for Few-Shot
Segmentation
- Authors: Sunghwan Hong, Seokju Cho, Jisu Nam, Stephen Lin, Seungryong Kim
- Abstract summary: Volumetric Aggregation with Transformers (VAT) is a cost aggregation network for few-shot segmentation.
VAT attains state-of-the-art performance for semantic correspondence as well, where cost aggregation also plays a central role.
- Score: 58.4650849317274
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents a novel cost aggregation network, called Volumetric
Aggregation with Transformers (VAT), for few-shot segmentation. The use of
transformers can benefit correlation map aggregation through self-attention
over a global receptive field. However, the tokenization of a correlation map
for transformer processing can be detrimental, because the discontinuity at
token boundaries reduces the local context available near the token edges and
decreases inductive bias. To address this problem, we propose a 4D
Convolutional Swin Transformer, where a high-dimensional Swin Transformer is
preceded by a series of small-kernel convolutions that impart local context to
all pixels and introduce convolutional inductive bias. We additionally boost
aggregation performance by applying transformers within a pyramidal structure,
where aggregation at a coarser level guides aggregation at a finer level. Noise
in the transformer output is then filtered in the subsequent decoder with the
help of the query's appearance embedding. With this model, a new
state-of-the-art is set for all the standard benchmarks in few-shot
segmentation. It is shown that VAT attains state-of-the-art performance for
semantic correspondence as well, where cost aggregation also plays a central
role.
Related papers
- CT-MVSNet: Efficient Multi-View Stereo with Cross-scale Transformer [8.962657021133925]
Cross-scale transformer (CT) processes feature representations at different stages without additional computation.
We introduce an adaptive matching-aware transformer (AMT) that employs different interactive attention combinations at multiple scales.
We also present a dual-feature guided aggregation (DFGA) that embeds the coarse global semantic information into the finer cost volume construction.
arXiv Detail & Related papers (2023-12-14T01:33:18Z) - Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation [59.91357714415056]
We propose two Transformer variants: Context-Sharing Transformer (CST) and Semantic Gathering-Scattering Transformer (S GST)
CST learns the global-shared contextual information within image frames with a lightweight computation; S GST models the semantic correlation separately for the foreground and background.
Compared with the baseline that uses vanilla Transformers for multi-stage fusion, ours significantly increase the speed by 13 times and achieves new state-of-the-art ZVOS performance.
arXiv Detail & Related papers (2023-08-13T06:12:00Z) - White-Box Transformers via Sparse Rate Reduction [25.51855431031564]
We show a family of white-box transformer-like deep network architectures which are mathematically fully interpretable.
Experiments show that these networks indeed learn to optimize the designed objective.
arXiv Detail & Related papers (2023-06-01T20:28:44Z) - Segmented Recurrent Transformer: An Efficient Sequence-to-Sequence Model [10.473819332984005]
We propose a segmented recurrent transformer (SRformer) that combines segmented (local) attention with recurrent attention.
The proposed model achieves $6-22%$ higher ROUGE1 scores than a segmented transformer and outperforms other recurrent transformer approaches.
arXiv Detail & Related papers (2023-05-24T03:47:22Z) - Deep Transformers without Shortcuts: Modifying Self-attention for
Faithful Signal Propagation [105.22961467028234]
Skip connections and normalisation layers are ubiquitous for the training of Deep Neural Networks (DNNs)
Recent approaches such as Deep Kernel Shaping have made progress towards reducing our reliance on them.
But these approaches are incompatible with the self-attention layers present in transformers.
arXiv Detail & Related papers (2023-02-20T21:26:25Z) - SSformer: A Lightweight Transformer for Semantic Segmentation [7.787950060560868]
Swin Transformer set a new record in various vision tasks by using hierarchical architecture and shifted windows.
We design a lightweight yet effective transformer model, called SSformer.
Experimental results show the proposed SSformer yields comparable mIoU performance with state-of-the-art models.
arXiv Detail & Related papers (2022-08-03T12:57:00Z) - Cost Aggregation Is All You Need for Few-Shot Segmentation [28.23753949369226]
We introduce Volumetric Aggregation with Transformers (VAT) to tackle the few-shot segmentation task.
VAT uses both convolutions and transformers to efficiently handle high dimensional correlation maps between query and support.
We find that the proposed method attains state-of-the-art performance even for the standard benchmarks in semantic correspondence task.
arXiv Detail & Related papers (2021-12-22T06:18:51Z) - nnFormer: Interleaved Transformer for Volumetric Segmentation [50.10441845967601]
We introduce nnFormer, a powerful segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution.
nnFormer achieves tremendous improvements over previous transformer-based methods on two commonly used datasets Synapse and ACDC.
arXiv Detail & Related papers (2021-09-07T17:08:24Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - Rethinking Global Context in Crowd Counting [70.54184500538338]
A pure transformer is used to extract features with global information from overlapping image patches.
Inspired by classification, we add a context token to the input sequence, to facilitate information exchange with tokens corresponding to image patches.
arXiv Detail & Related papers (2021-05-23T12:44:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.