Vision Backbone Enhancement via Multi-Stage Cross-Scale Attention
- URL: http://arxiv.org/abs/2308.05872v2
- Date: Mon, 14 Aug 2023 18:27:12 GMT
- Title: Vision Backbone Enhancement via Multi-Stage Cross-Scale Attention
- Authors: Liang Shang, Yanli Liu, Zhengyang Lou, Shuxue Quan, Nagesh Adluru,
Bochen Guan, William A. Sethares
- Abstract summary: Multi-Stage Cross-Scale Attention (MSCSA) module takes feature maps from different stages to enable multi-stage interactions.
MSCSA provides a significant performance boost with modest additional FLOPs and runtime.
- Score: 5.045944819606334
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convolutional neural networks (CNNs) and vision transformers (ViTs) have
achieved remarkable success in various vision tasks. However, many
architectures do not consider interactions between feature maps from different
stages and scales, which may limit their performance. In this work, we propose
a simple add-on attention module to overcome these limitations via multi-stage
and cross-scale interactions. Specifically, the proposed Multi-Stage
Cross-Scale Attention (MSCSA) module takes feature maps from different stages
to enable multi-stage interactions and achieves cross-scale interactions by
computing self-attention at different scales based on the multi-stage feature
maps. Our experiments on several downstream tasks show that MSCSA provides a
significant performance boost with modest additional FLOPs and runtime.
Related papers
- Semi-supervised Semantic Segmentation for Remote Sensing Images via Multi-scale Uncertainty Consistency and Cross-Teacher-Student Attention [59.19580789952102]
This paper proposes a novel semi-supervised Multi-Scale Uncertainty and Cross-Teacher-Student Attention (MUCA) model for RS image semantic segmentation tasks.
MUCA constrains the consistency among feature maps at different layers of the network by introducing a multi-scale uncertainty consistency regularization.
MUCA utilizes a Cross-Teacher-Student attention mechanism to guide the student network, guiding the student network to construct more discriminative feature representations.
arXiv Detail & Related papers (2025-01-18T11:57:20Z) - Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models [50.98559225639266]
We investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories.
Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance.
We propose the instruction-guided vision aggregator, a module that dynamically integrates multi-layer visual features based on textual instructions.
arXiv Detail & Related papers (2024-12-26T05:41:31Z) - Multi-Level Embedding and Alignment Network with Consistency and Invariance Learning for Cross-View Geo-Localization [2.733505168507872]
Cross-View Geo-Localization (CVGL) involves determining the localization of drone images by retrieving the most similar GPS-tagged satellite images.
Existing methods often overlook the problem of increased computational and storage requirements when improving model performance.
We propose a lightweight enhanced alignment network, called the Multi-Level Embedding and Alignment Network (MEAN)
arXiv Detail & Related papers (2024-12-19T13:10:38Z) - HSVLT: Hierarchical Scale-Aware Vision-Language Transformer for Multi-Label Image Classification [15.129037250680582]
Tight visual-linguistic interactions play a vital role in improving classification performance.
Recent Transformer-based methods have achieved great success in multi-label image classification.
We propose a Hierarchical Scale-Aware Vision-Language Transformer (HSVLT) with two appealing designs.
arXiv Detail & Related papers (2024-07-23T07:31:42Z) - An Efficient General-Purpose Modular Vision Model via Multi-Task
Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently.
Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z) - Demystify Transformers & Convolutions in Modern Image Deep Networks [80.16624587948368]
This paper aims to identify the real gains of popular convolution and attention operators through a detailed study.
We find that the key difference among these feature transformation modules, such as attention or convolution, lies in their spatial feature aggregation approach.
Various STMs are integrated into this unified framework for comprehensive comparative analysis.
arXiv Detail & Related papers (2022-11-10T18:59:43Z) - Sequential Cross Attention Based Multi-task Learning [22.430705836627148]
We propose a novel architecture that effectively transfers informative features by applying the attention mechanism to the multi-scale features of the tasks.
Our method achieves state-of-the-art performance on the NYUD-v2 and PASCAL-Context dataset.
arXiv Detail & Related papers (2022-09-06T14:17:33Z) - Learning Visual Representation from Modality-Shared Contrastive
Language-Image Pre-training [88.80694147730883]
We investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks.
In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters.
Our approach outperforms vanilla CLIP by 1.6 points in linear probing on a collection of 24 downstream vision tasks.
arXiv Detail & Related papers (2022-07-26T05:19:16Z) - Progressive Multi-stage Interactive Training in Mobile Network for
Fine-grained Recognition [8.727216421226814]
We propose a Progressive Multi-Stage Interactive training method with a Recursive Mosaic Generator (RMG-PMSI)
First, we propose a Recursive Mosaic Generator (RMG) that generates images with different granularities in different phases.
Then, the features of different stages pass through a Multi-Stage Interaction (MSI) module, which strengthens and complements the corresponding features of different stages.
Experiments on three prestigious fine-grained benchmarks show that RMG-PMSI can significantly improve the performance with good robustness and transferability.
arXiv Detail & Related papers (2021-12-08T10:50:03Z) - Encoder Fusion Network with Co-Attention Embedding for Referring Image
Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network.
A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features.
The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z) - Cross-Level Cross-Scale Cross-Attention Network for Point Cloud
Representation [8.76786786874107]
Self-attention mechanism recently achieves impressive advancement in Natural Language Processing (NLP) and Image Processing domains.
We propose an end-to-end architecture, dubbed Cross-Level Cross-Scale Cross-Attention Network (CLCSCANet) for point cloud representation learning.
arXiv Detail & Related papers (2021-04-27T09:01:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.