Brain-Inspired Stepwise Patch Merging for Vision Transformers
- URL: http://arxiv.org/abs/2409.06963v2
- Date: Sat, 14 Jun 2025 12:11:04 GMT
- Title: Brain-Inspired Stepwise Patch Merging for Vision Transformers
- Authors: Yonghao Yu, Dongcheng Zhao, Guobin Shen, Yiting Dong, Yi Zeng,
- Abstract summary: We propose Stepwise Patch Merging (SPM), which enhances the subsequent attention mechanism's ability to'see' better.<n>The code has been released at https://github.com/Yonghao-Yu/StepwisePatchMerging.
- Score: 6.108377966393714
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The hierarchical architecture has become a mainstream design paradigm for Vision Transformers (ViTs), with Patch Merging serving as the pivotal component that transforms a columnar architecture into a hierarchical one. Drawing inspiration from the brain's ability to integrate global and local information for comprehensive visual understanding, we propose Stepwise Patch Merging (SPM), which enhances the subsequent attention mechanism's ability to 'see' better. SPM consists of Multi-Scale Aggregation (MSA) and Guided Local Enhancement (GLE) striking a proper balance between long-range dependency modeling and local feature enhancement. Extensive experiments conducted on benchmark datasets, including ImageNet-1K, COCO, and ADE20K, demonstrate that SPM significantly improves the performance of various models, particularly in dense prediction tasks such as object detection and semantic segmentation. Meanwhile, experiments show that combining SPM with different backbones can further improve performance. The code has been released at https://github.com/Yonghao-Yu/StepwisePatchMerging.
Related papers
- Structural Similarity-Inspired Unfolding for Lightweight Image Super-Resolution [88.20464308588889]
We propose a Structural Similarity-Inspired Unfolding (SSIU) method for efficient image SR.<n>This method is designed through unfolding an SR optimization function constrained by structural similarity.<n>Our model outperforms current state-of-the-art models, boasting lower parameter counts and reduced memory consumption.
arXiv Detail & Related papers (2025-06-13T14:29:40Z) - MSSFC-Net:Enhancing Building Interpretation with Multi-Scale Spatial-Spectral Feature Collaboration [4.480146005071275]
Building interpretation from remote sensing imagery primarily involves two fundamental tasks: building extraction and change detection.
We propose a Multi-Scale Spatial-Spectral Feature Cooperative Dual-Task Network (MSSFC-Net) for joint building extraction and change detection in remote sensing images.
arXiv Detail & Related papers (2025-04-01T13:10:23Z) - Semi-supervised Semantic Segmentation with Multi-Constraint Consistency Learning [81.02648336552421]
We propose a Multi-Constraint Consistency Learning approach to facilitate the staged enhancement of the encoder and decoder.<n>Self-adaptive feature masking and noise injection are designed in an instance-specific manner to perturb the features for robust learning of the decoder.<n> Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state-of-the-art performance.
arXiv Detail & Related papers (2025-03-23T03:21:33Z) - Optimized Unet with Attention Mechanism for Multi-Scale Semantic Segmentation [8.443350618722564]
This paper proposes an improved Unet model combined with an attention mechanism.
It introduces channel attention and spatial attention modules, enhances the model's ability to focus on important features.
The improved model performs well in terms of mIoU and pixel accuracy (PA), reaching 76.5% and 95.3% respectively.
arXiv Detail & Related papers (2025-02-06T06:51:23Z) - ContextFormer: Redefining Efficiency in Semantic Segmentation [48.81126061219231]
Convolutional methods, although capturing local dependencies well, struggle with long-range relationships.
Vision Transformers (ViTs) excel in global context capture but are hindered by high computational demands.
We propose ContextFormer, a hybrid framework leveraging the strengths of CNNs and ViTs in the bottleneck to balance efficiency, accuracy, and robustness for real-time semantic segmentation.
arXiv Detail & Related papers (2025-01-31T16:11:04Z) - MAT: Multi-Range Attention Transformer for Efficient Image Super-Resolution [14.265237560766268]
A flexible integration of attention across diverse spatial extents can yield significant performance enhancements.
We introduce Multi-Range Attention Transformer (MAT) tailored for Super Resolution (SR) tasks.
MAT adeptly capture dependencies across various spatial ranges, improving the diversity and efficacy of its feature representations.
arXiv Detail & Related papers (2024-11-26T08:30:31Z) - EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment [39.870809905905325]
We propose Empowering Multi-modal Mamba with Structural and Hierarchical Alignment (EMMA) to extract fine-grained visual information.
Our model shows lower latency than other Mamba-based MLLMs and is nearly four times faster than transformer-based MLLMs of similar scale during inference.
arXiv Detail & Related papers (2024-10-08T11:41:55Z) - EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings.
EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z) - CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications [73.80247057590519]
Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer's powerful global context capability.<n>We introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers, to achieve a balance between efficiency and performance in mobile applications.<n>Our model achieves 83.0%/84.1% top-1 with only 12M/21M parameters on ImageNet-1K.
arXiv Detail & Related papers (2024-08-07T11:33:46Z) - P-MSDiff: Parallel Multi-Scale Diffusion for Remote Sensing Image Segmentation [8.46409964236009]
Diffusion models and multi-scale features are essential components in semantic segmentation tasks.
We propose a new model for semantic segmentation known as the diffusion model with parallel multi-scale branches.
Our model demonstrates superior performance based on the J1 metric on both the UAVid and Vaihingen Building datasets.
arXiv Detail & Related papers (2024-05-30T19:40:08Z) - AMMUNet: Multi-Scale Attention Map Merging for Remote Sensing Image Segmentation [4.618389486337933]
We propose AMMUNet, a UNet-based framework that employs multi-scale attention map merging.
The proposed AMMM effectively combines multi-scale attention maps into a unified representation using a fixed mask template.
We show that our approach achieves remarkable mean intersection over union (mIoU) scores of 75.48% on the Vaihingen dataset and an exceptional 77.90% on the Potsdam dataset.
arXiv Detail & Related papers (2024-04-20T15:23:15Z) - USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality.
Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z) - Demystify Transformers & Convolutions in Modern Image Deep Networks [80.16624587948368]
This paper aims to identify the real gains of popular convolution and attention operators through a detailed study.<n>We find that the key difference among these feature transformation modules, such as attention or convolution, lies in their spatial feature aggregation approach.<n>Various STMs are integrated into this unified framework for comprehensive comparative analysis.
arXiv Detail & Related papers (2022-11-10T18:59:43Z) - MAFormer: A Transformer Network with Multi-scale Attention Fusion for
Visual Recognition [45.68567088645708]
We introduce Multi-scale Attention Fusion into transformer (MAFormer)
MAFormer explores local aggregation and global feature extraction in a dual-stream framework for visual recognition.
Our MAFormer achieves state-of-the-art performance on common vision tasks.
arXiv Detail & Related papers (2022-08-31T06:29:27Z) - SIM-Trans: Structure Information Modeling Transformer for Fine-grained
Visual Categorization [59.732036564862796]
We propose the Structure Information Modeling Transformer (SIM-Trans) to incorporate object structure information into transformer for enhancing discriminative representation learning.
The proposed two modules are light-weighted and can be plugged into any transformer network and trained end-to-end easily.
Experiments and analyses demonstrate that the proposed SIM-Trans achieves state-of-the-art performance on fine-grained visual categorization benchmarks.
arXiv Detail & Related papers (2022-08-31T03:00:07Z) - Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS)
The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture.
It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z) - Aggregating Global Features into Local Vision Transformer [20.174762373916415]
Local Transformer-based classification models have recently achieved promising results with relatively low computational costs.
This work investigates the outcome of applying a global attention-based module named multi-resolution overlapped attention (MOA) in the local window-based transformer after each stage.
The proposed MOA employs slightly larger and overlapped patches in the key to enable neighborhood pixel information transmission, which leads to significant performance gain.
arXiv Detail & Related papers (2022-01-30T19:57:35Z) - Image-specific Convolutional Kernel Modulation for Single Image
Super-resolution [85.09413241502209]
In this issue, we propose a novel image-specific convolutional modulation kernel (IKM)
We exploit the global contextual information of image or feature to generate an attention weight for adaptively modulating the convolutional kernels.
Experiments on single image super-resolution show that the proposed methods achieve superior performances over state-of-the-art methods.
arXiv Detail & Related papers (2021-11-16T11:05:10Z) - Boosting Few-shot Semantic Segmentation with Transformers [81.43459055197435]
TRansformer-based Few-shot Semantic segmentation method (TRFS)
Our model consists of two modules: Global Enhancement Module (GEM) and Local Enhancement Module (LEM)
arXiv Detail & Related papers (2021-08-04T20:09:21Z) - Sequential Hierarchical Learning with Distribution Transformation for
Image Super-Resolution [83.70890515772456]
We build a sequential hierarchical learning super-resolution network (SHSR) for effective image SR.
We consider the inter-scale correlations of features, and devise a sequential multi-scale block (SMB) to progressively explore the hierarchical information.
Experiment results show SHSR achieves superior quantitative performance and visual quality to state-of-the-art methods.
arXiv Detail & Related papers (2020-07-19T01:35:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.