Related papers: iiANET: Inception Inspired Attention Hybrid Network for efficient Long-Range Dependency

iiANET: Inception Inspired Attention Hybrid Network for efficient Long-Range Dependency

URL: http://arxiv.org/abs/2407.07603v2
Date: Sat, 12 Apr 2025 11:32:38 GMT
Title: iiANET: Inception Inspired Attention Hybrid Network for efficient Long-Range Dependency
Authors: Haruna Yunusa, Qin Shiyin, Abdulrahman Hamman Adama Chukkol, Adamu Lawan, Abdulganiyu Abdu Yusuf, Isah Bello,
Abstract summary: We present iiANET, an efficient hybrid visual backbone designed to improve the modeling of long-range dependen-cies.<n>The core innovation of iiANET is the iiABlock, a unified building block that in-tegrates global r-MHSA (Multi-Head Self-Attention) and convolutional layers in paral-lel.
Score: 0.5497663232622965
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The recent emergence of hybrid models has introduced a transformative approach to computer vision, gradually moving beyond conventional convolutional neural net-works and vision transformers. However, efficiently combining these two paradigms to better capture long-range dependencies in complex images remains a challenge. In this paper, we present iiANET (Inception Inspired Attention Network), an efficient hybrid visual backbone designed to improve the modeling of long-range dependen-cies. The core innovation of iiANET is the iiABlock, a unified building block that in-tegrates global r-MHSA (Multi-Head Self-Attention) and convolutional layers in paral-lel. This design enables iiABlock to simultaneously capture global context and local details, making it highly effective for extracting rich and diverse features. By effi-ciently fusing these complementary representations, iiABlock allows iiANET to achieve strong feature interaction while maintaining computational efficiency. Exten-sive qualitative and quantitative evaluations across various benchmarks show im-proved performance over several state-of-the-art models.

Related papers

Structural Similarity-Inspired Unfolding for Lightweight Image Super-Resolution [88.20464308588889]
We propose a Structural Similarity-Inspired Unfolding (SSIU) method for efficient image SR.<n>This method is designed through unfolding an SR optimization function constrained by structural similarity.<n>Our model outperforms current state-of-the-art models, boasting lower parameter counts and reduced memory consumption.
arXiv Detail & Related papers (2025-06-13T14:29:40Z)
Visual Evolutionary Optimization on Combinatorial Problems with Multimodal Large Language Models: A Case Study of Influence Maximization [7.890526174400841]
Graph-structured problems in complex networks are prevalent in many domains, and are computationally demanding.<n>Traditional evolutionary algorithms (EAs) often face obstacles due to content-shallow encoding limitations and lack of structural awareness.<n>We introduce an original framework, Visual Evolutionary Optimization (VEO), leveraging multimodal large language models (MLLMs) as the evolutionary backbone.
arXiv Detail & Related papers (2025-05-11T05:23:02Z)
An Efficient and Mixed Heterogeneous Model for Image Restoration [71.85124734060665]
Current mainstream approaches are based on three architectural paradigms: CNNs, Transformers, and Mambas. We propose RestorMixer, an efficient and general-purpose IR model based on mixed-architecture fusion.
arXiv Detail & Related papers (2025-04-15T08:19:12Z)
vGamba: Attentive State Space Bottleneck for efficient Long-range Dependencies in Visual Recognition [0.0]
State-space models (SSMs) offer an alternative, but their application in vision remains underexplored. This work introduces vGamba, a hybrid vision backbone that integrates SSMs with attention mechanisms to enhance efficiency and expressiveness. Tests on classification, detection, and segmentation tasks demonstrate that vGamba achieves a superior trade-off between accuracy and computational efficiency, outperforming several existing models.
arXiv Detail & Related papers (2025-03-27T08:39:58Z)
Towards Efficient Model-Heterogeneity Federated Learning for Large Models [18.008063521900702]
We introduce HeteroTune, an innovative fine-tuning framework tailored for model-heterogeneity federated learning (MHFL) In particular, we propose a novel parameter-efficient fine-tuning structure, called FedAdapter, which employs a multi-branch cross-model aggregator. Benefiting from the lightweight FedAdapter, our approach significantly reduces both the computational and communication overhead.
arXiv Detail & Related papers (2024-11-25T09:58:51Z)
Dual-Hybrid Attention Network for Specular Highlight Removal [34.99543751199565]
Specular highlight removal plays a pivotal role in multimedia applications, as it enhances the quality and interpretability of images and videos. Current state-of-the-art approaches often rely on additional priors or supervision, limiting their practicality and generalization capability. We propose the Dual-Hybrid Attention Network for Specular Highlight Removal (DHAN-SHR), an end-to-end network that introduces novel hybrid attention mechanisms.
arXiv Detail & Related papers (2024-07-17T01:52:41Z)
Hybrid Convolutional and Attention Network for Hyperspectral Image Denoising [54.110544509099526]
Hyperspectral image (HSI) denoising is critical for the effective analysis and interpretation of hyperspectral data. We propose a hybrid convolution and attention network (HCANet) to enhance HSI denoising. Experimental results on mainstream HSI datasets demonstrate the rationality and effectiveness of the proposed HCANet.
arXiv Detail & Related papers (2024-03-15T07:18:43Z)
NiNformer: A Network in Network Transformer with Token Mixing as a Gating Function Generator [1.3812010983144802]
The attention mechanism was utilized in computer vision as the Vision Transformer ViT. It comes with the drawback of being expensive and requiring datasets of considerable size for effective optimization. This paper introduces a new computational block as an alternative to the standard ViT block that reduces the compute burdens.
arXiv Detail & Related papers (2024-03-04T19:08:20Z)
RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies. Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks. Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z)
ViR: Towards Efficient Vision Retention Backbones [97.93707844681893]
We propose a new class of computer vision models, dubbed Vision Retention Networks (ViR) ViR has dual parallel and recurrent formulations, which strike an optimal balance between fast inference and parallel training with competitive performance. We have validated the effectiveness of ViR through extensive experiments with different dataset sizes and various image resolutions.
arXiv Detail & Related papers (2023-10-30T16:55:50Z)
Distance Weighted Trans Network for Image Completion [52.318730994423106]
We propose a new architecture that relies on Distance-based Weighted Transformer (DWT) to better understand the relationships between an image's components. CNNs are used to augment the local texture information of coarse priors. DWT blocks are used to recover certain coarse textures and coherent visual structures.
arXiv Detail & Related papers (2023-10-11T12:46:11Z)
Enhancing Once-For-All: A Study on Parallel Blocks, Skip Connections and Early Exits [7.0895962209555465]
Once-For-All (OFA) is an eco-friendly algorithm characterised by the ability to generate easily adaptable models. OFA is improved from an architectural point of view by including early exits, parallel blocks and dense skip connections. OFAAv2 improves its accuracy performance on the Tiny ImageNet dataset by up to 12.07% compared to the original version of OFA.
arXiv Detail & Related papers (2023-02-03T17:53:40Z)
A Generic Shared Attention Mechanism for Various Backbone Neural Networks [53.36677373145012]
Self-attention modules (SAMs) produce strongly correlated attention maps across different layers. Dense-and-Implicit Attention (DIA) shares SAMs across layers and employs a long short-term memory module. Our simple yet effective DIA can consistently enhance various network backbones.
arXiv Detail & Related papers (2022-10-27T13:24:08Z)
EMC2A-Net: An Efficient Multibranch Cross-channel Attention Network for SAR Target Classification [10.479559839534033]
This paper proposed two residual blocks, namely EMC2A blocks with multiscale receptive fields(RFs), based on a multibranch structure and then designed an efficient isotopic architecture deep CNN (DCNN), EMC2A-Net. EMC2A blocks utilize parallel dilated convolution with different dilation rates, which can effectively capture multiscale context features without significantly increasing the computational burden. This paper proposed a multiscale feature cross-channel attention module, namely the EMC2A module, adopting a local multiscale feature interaction strategy without dimensionality reduction.
arXiv Detail & Related papers (2022-08-03T04:31:52Z)
Activating More Pixels in Image Super-Resolution Transformer [53.87533738125943]
Transformer-based methods have shown impressive performance in low-level vision tasks, such as image super-resolution. We propose a novel Hybrid Attention Transformer (HAT) to activate more input pixels for better reconstruction. Our overall method significantly outperforms the state-of-the-art methods by more than 1dB.
arXiv Detail & Related papers (2022-05-09T17:36:58Z)
AA-RMVSNet: Adaptive Aggregation Recurrent Multi-view Stereo Network [8.127449025802436]
We present a novel recurrent multi-view stereo network based on long short-term memory (LSTM) with adaptive aggregation, namely AA-RMVSNet. We firstly introduce an intra-view aggregation module to adaptively extract image features by using context-aware convolution and multi-scale aggregation. We propose an inter-view cost volume aggregation module for adaptive pixel-wise view aggregation, which is able to preserve better-matched pairs among all views.
arXiv Detail & Related papers (2021-08-09T06:10:48Z)
MVFNet: Multi-View Fusion Network for Efficient Video Recognition [79.92736306354576]
We introduce a multi-view fusion (MVF) module to exploit video complexity using separable convolution for efficiency. MVFNet can be thought of as a generalized video modeling framework.
arXiv Detail & Related papers (2020-12-13T06:34:18Z)
Recursive Multi-model Complementary Deep Fusion forRobust Salient Object Detection via Parallel Sub Networks [62.26677215668959]
Fully convolutional networks have shown outstanding performance in the salient object detection (SOD) field. This paper proposes a wider'' network architecture which consists of parallel sub networks with totally different network architectures. Experiments on several famous benchmarks clearly demonstrate the superior performance, good generalization, and powerful learning ability of the proposed wider framework.
arXiv Detail & Related papers (2020-08-07T10:39:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.