Guided Patch-Grouping Wavelet Transformer with Spatial Congruence for
Ultra-High Resolution Segmentation
- URL: http://arxiv.org/abs/2307.00711v2
- Date: Thu, 6 Jul 2023 02:54:16 GMT
- Title: Guided Patch-Grouping Wavelet Transformer with Spatial Congruence for
Ultra-High Resolution Segmentation
- Authors: Deyi Ji, Feng Zhao, Hongtao Lu
- Abstract summary: Proposed Guided Patch-Grouping Wavelet Transformer (GPWFormer)
$mathcalT$ takes the whole UHR image as input and harvests both local details and fine-grained long-range contextual dependencies.
$mathcalC$ takes downsampled image as input for learning the category-wise deep context.
- Score: 18.50799240622156
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most existing ultra-high resolution (UHR) segmentation methods always
struggle in the dilemma of balancing memory cost and local characterization
accuracy, which are both taken into account in our proposed Guided
Patch-Grouping Wavelet Transformer (GPWFormer) that achieves impressive
performances. In this work, GPWFormer is a Transformer ($\mathcal{T}$)-CNN
($\mathcal{C}$) mutual leaning framework, where $\mathcal{T}$ takes the whole
UHR image as input and harvests both local details and fine-grained long-range
contextual dependencies, while $\mathcal{C}$ takes downsampled image as input
for learning the category-wise deep context. For the sake of high inference
speed and low computation complexity, $\mathcal{T}$ partitions the original UHR
image into patches and groups them dynamically, then learns the low-level local
details with the lightweight multi-head Wavelet Transformer (WFormer) network.
Meanwhile, the fine-grained long-range contextual dependencies are also
captured during this process, since patches that are far away in the spatial
domain can also be assigned to the same group. In addition, masks produced by
$\mathcal{C}$ are utilized to guide the patch grouping process, providing a
heuristics decision. Moreover, the congruence constraints between the two
branches are also exploited to maintain the spatial consistency among the
patches. Overall, we stack the multi-stage process in a pyramid way.
Experiments show that GPWFormer outperforms the existing methods with
significant improvements on five benchmark datasets.
Related papers
- RefineStyle: Dynamic Convolution Refinement for StyleGAN [15.230430037135017]
In StyleGAN, convolution kernels are shaped by both static parameters shared across images.
$mathcalW+$ space is often used for image inversion and editing.
This paper proposes an efficient refining strategy for dynamic kernels.
arXiv Detail & Related papers (2024-10-08T15:01:30Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - The Need for Speed: Pruning Transformers with One Recipe [18.26707877972931]
OPTIN is a tool to increase the efficiency of pre-trained transformer architectures without re-training.
It produces state-of-the-art results on natural language, image classification, transfer learning, and semantic segmentation tasks.
We show a $leq 2$% accuracy degradation from NLP baselines and a $0.5$% improvement from state-of-the-art methods on image classification at competitive FLOPs reductions.
arXiv Detail & Related papers (2024-03-26T17:55:58Z) - Chain of Thought Empowers Transformers to Solve Inherently Serial Problems [57.58801785642868]
Chain of thought (CoT) is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks.
This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness.
arXiv Detail & Related papers (2024-02-20T10:11:03Z) - p-Laplacian Transformer [7.2541371193810384]
$p$-Laplacian regularization, rooted in graph and image signal processing, introduces a parameter $p$ to control the regularization effect on these data.
We first show that the self-attention mechanism obtains the minimal Laplacian regularization.
We then propose a novel class of transformers, namely the $p$-Laplacian Transformer (p-LaT)
arXiv Detail & Related papers (2023-11-06T16:25:56Z) - Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem.
We characterize the implicit bias of 1-layer transformers optimized with gradient descent.
We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z) - Recasting Self-Attention with Holographic Reduced Representations [31.89878931813593]
Motivated by problems in malware detection, we re-cast self-attention using the neuro-symbolic approach of Holographic Reduced Representations (HRR)
We obtain several benefits including $mathcalO(T H log H)$ time complexity, $mathcalO(T H)$ space complexity, and convergence in $10times$ fewer epochs.
Our Hrrformer achieves near state-of-the-art accuracy on LRA benchmarks and we are able to learn with just a single layer.
arXiv Detail & Related papers (2023-05-31T03:42:38Z) - Generalization Bounds for Stochastic Gradient Descent via Localized
$\varepsilon$-Covers [16.618918548497223]
We propose a new covering technique localized for the trajectories of SGD.
This localization provides an algorithm-specific clustering measured by the bounds number.
We derive these results in various contexts and improve the known state-of-the-art label rates.
arXiv Detail & Related papers (2022-09-19T12:11:07Z) - Cost Aggregation with 4D Convolutional Swin Transformer for Few-Shot
Segmentation [58.4650849317274]
Volumetric Aggregation with Transformers (VAT) is a cost aggregation network for few-shot segmentation.
VAT attains state-of-the-art performance for semantic correspondence as well, where cost aggregation also plays a central role.
arXiv Detail & Related papers (2022-07-22T04:10:30Z) - Dense Gaussian Processes for Few-Shot Segmentation [66.08463078545306]
We propose a few-shot segmentation method based on dense Gaussian process (GP) regression.
We exploit the end-to-end learning capabilities of our approach to learn a high-dimensional output space for the GP.
Our approach sets a new state-of-the-art for both 1-shot and 5-shot FSS on the PASCAL-5$i$ and COCO-20$i$ benchmarks.
arXiv Detail & Related papers (2021-10-07T17:57:54Z) - Region adaptive graph fourier transform for 3d point clouds [51.193111325231165]
We introduce the Region Adaptive Graph Fourier Transform (RA-GFT) for compression of 3D point cloud attributes.
The RA-GFT achieves better complexity-performance trade-offs than previous approaches.
arXiv Detail & Related papers (2020-03-04T02:47:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.