Related papers: iiANET: Inception Inspired Attention Hybrid Network for efficient Long-Range Dependency

Related papers

Integrating ConvNeXt and Vision Transformers for Enhancing Facial Age Estimation [10.995974662579124]
We present a novel hybrid architecture that combines convolutional neural networks (CNNs) with Vision Transformers (ViT)<n>Our proposed ConvNeXt-ViT hybrid solution was thoroughly evaluated on benchmark age estimation datasets.<n>The results show that the ConvNeXt-ViT hybrid not only outperforms traditional methods, but also provides a robust foundation for future advances in age estimation.
arXiv Detail & Related papers (2025-10-31T09:36:28Z)
EchoGen: Generating Visual Echoes in Any Scene via Feed-Forward Subject-Driven Auto-Regressive Model [56.53617289548353]
EchoGen is a pioneering framework that empowers Visual Auto-Regressive ( VAR) models with subject-driven generation capabilities.<n>We employ a semantic encoder to extract the subject's abstract identity, which is injected through decoupled cross-attention to guide the overall composition.<n>To the best of our knowledge, EchoGen is the first feed-forward subject-driven framework built upon VAR models.
arXiv Detail & Related papers (2025-09-30T11:45:48Z)
A Lightweight Convolution and Vision Transformer integrated model with Multi-scale Self-attention Mechanism [41.02402160100821]
Vision Transformer (ViT) has prevailed in computer vision tasks due to its strong long-range dependency modelling ability.<n>We propose an efficient ViT model with sparse attention (dubbed SAEViT) and convolution blocks.<n>Experiments on mainstream datasets show that SAEViT achieves Top-1 accuracies of 76.3% and 79.6% on the ImageNet-1K classification task.
arXiv Detail & Related papers (2025-08-23T03:05:34Z)
GCRPNet: Graph-Enhanced Contextual and Regional Perception Network for Salient Object Detection in Optical Remote Sensing Images [68.33481681452675]
We propose a graph-enhanced contextual and regional perception network (GCRPNet)<n>It builds upon the Mamba architecture to simultaneously capture long-range dependencies and enhance regional feature representation.<n>It performs adaptive patch scanning on feature maps processed via multi-scale convolutions, thereby capturing rich local region information.
arXiv Detail & Related papers (2025-08-14T11:31:43Z)
Structural Similarity-Inspired Unfolding for Lightweight Image Super-Resolution [88.20464308588889]
We propose a Structural Similarity-Inspired Unfolding (SSIU) method for efficient image SR.<n>This method is designed through unfolding an SR optimization function constrained by structural similarity.<n>Our model outperforms current state-of-the-art models, boasting lower parameter counts and reduced memory consumption.
arXiv Detail & Related papers (2025-06-13T14:29:40Z)
Visual Evolutionary Optimization on Combinatorial Problems with Multimodal Large Language Models: A Case Study of Influence Maximization [7.890526174400841]
Graph-structured problems in complex networks are prevalent in many domains, and are computationally demanding.<n>Traditional evolutionary algorithms (EAs) often face obstacles due to content-shallow encoding limitations and lack of structural awareness.<n>We introduce an original framework, Visual Evolutionary Optimization (VEO), leveraging multimodal large language models (MLLMs) as the evolutionary backbone.
arXiv Detail & Related papers (2025-05-11T05:23:02Z)
VRS-UIE: Value-Driven Reordering Scanning for Underwater Image Enhancement [104.78586859995333]
State Space Models (SSMs) have emerged as a promising backbone for vision tasks due to their linear complexity and global receptive field.<n>The predominance of large-portion, homogeneous but useless oceanic backgrounds can dilute the feature representation responses of sparse yet valuable targets.<n>We propose a novel Value-Driven Reordering Scanning framework for Underwater Image Enhancement (UIE)<n>Our framework sets a new state-of-the-art, delivering superior enhancement performance (surpassing WMamba by 0.89 dB on average) by effectively suppressing water bias and preserving structural and color fidelity.
arXiv Detail & Related papers (2025-05-02T12:21:44Z)
An Efficient and Mixed Heterogeneous Model for Image Restoration [71.85124734060665]
Current mainstream approaches are based on three architectural paradigms: CNNs, Transformers, and Mambas. We propose RestorMixer, an efficient and general-purpose IR model based on mixed-architecture fusion.
arXiv Detail & Related papers (2025-04-15T08:19:12Z)
vGamba: Attentive State Space Bottleneck for efficient Long-range Dependencies in Visual Recognition [0.0]
State-space models (SSMs) offer an alternative, but their application in vision remains underexplored. This work introduces vGamba, a hybrid vision backbone that integrates SSMs with attention mechanisms to enhance efficiency and expressiveness. Tests on classification, detection, and segmentation tasks demonstrate that vGamba achieves a superior trade-off between accuracy and computational efficiency, outperforming several existing models.
arXiv Detail & Related papers (2025-03-27T08:39:58Z)
Towards Efficient Model-Heterogeneity Federated Learning for Large Models [18.008063521900702]
We introduce HeteroTune, an innovative fine-tuning framework tailored for model-heterogeneity federated learning (MHFL) In particular, we propose a novel parameter-efficient fine-tuning structure, called FedAdapter, which employs a multi-branch cross-model aggregator. Benefiting from the lightweight FedAdapter, our approach significantly reduces both the computational and communication overhead.
arXiv Detail & Related papers (2024-11-25T09:58:51Z)
Dual-Hybrid Attention Network for Specular Highlight Removal [34.99543751199565]
Specular highlight removal plays a pivotal role in multimedia applications, as it enhances the quality and interpretability of images and videos. Current state-of-the-art approaches often rely on additional priors or supervision, limiting their practicality and generalization capability. We propose the Dual-Hybrid Attention Network for Specular Highlight Removal (DHAN-SHR), an end-to-end network that introduces novel hybrid attention mechanisms.
arXiv Detail & Related papers (2024-07-17T01:52:41Z)
Hybrid Convolutional and Attention Network for Hyperspectral Image Denoising [54.110544509099526]
Hyperspectral image (HSI) denoising is critical for the effective analysis and interpretation of hyperspectral data. We propose a hybrid convolution and attention network (HCANet) to enhance HSI denoising. Experimental results on mainstream HSI datasets demonstrate the rationality and effectiveness of the proposed HCANet.
arXiv Detail & Related papers (2024-03-15T07:18:43Z)
NiNformer: A Network in Network Transformer with Token Mixing as a Gating Function Generator [1.3812010983144802]
The attention mechanism was utilized in computer vision as the Vision Transformer ViT. It comes with the drawback of being expensive and requiring datasets of considerable size for effective optimization. This paper introduces a new computational block as an alternative to the standard ViT block that reduces the compute burdens.
arXiv Detail & Related papers (2024-03-04T19:08:20Z)
RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies. Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks. Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z)
ViR: Towards Efficient Vision Retention Backbones [97.93707844681893]
We propose a new class of computer vision models, dubbed Vision Retention Networks (ViR) ViR has dual parallel and recurrent formulations, which strike an optimal balance between fast inference and parallel training with competitive performance. We have validated the effectiveness of ViR through extensive experiments with different dataset sizes and various image resolutions.
arXiv Detail & Related papers (2023-10-30T16:55:50Z)
Distance Weighted Trans Network for Image Completion [52.318730994423106]
We propose a new architecture that relies on Distance-based Weighted Transformer (DWT) to better understand the relationships between an image's components. CNNs are used to augment the local texture information of coarse priors. DWT blocks are used to recover certain coarse textures and coherent visual structures.
arXiv Detail & Related papers (2023-10-11T12:46:11Z)
Enhancing Once-For-All: A Study on Parallel Blocks, Skip Connections and Early Exits [7.0895962209555465]
Once-For-All (OFA) is an eco-friendly algorithm characterised by the ability to generate easily adaptable models. OFA is improved from an architectural point of view by including early exits, parallel blocks and dense skip connections. OFAAv2 improves its accuracy performance on the Tiny ImageNet dataset by up to 12.07% compared to the original version of OFA.
arXiv Detail & Related papers (2023-02-03T17:53:40Z)
A Generic Shared Attention Mechanism for Various Backbone Neural Networks [53.36677373145012]
Self-attention modules (SAMs) produce strongly correlated attention maps across different layers. Dense-and-Implicit Attention (DIA) shares SAMs across layers and employs a long short-term memory module. Our simple yet effective DIA can consistently enhance various network backbones.
arXiv Detail & Related papers (2022-10-27T13:24:08Z)
EMC2A-Net: An Efficient Multibranch Cross-channel Attention Network for SAR Target Classification [10.479559839534033]
This paper proposed two residual blocks, namely EMC2A blocks with multiscale receptive fields(RFs), based on a multibranch structure and then designed an efficient isotopic architecture deep CNN (DCNN), EMC2A-Net. EMC2A blocks utilize parallel dilated convolution with different dilation rates, which can effectively capture multiscale context features without significantly increasing the computational burden. This paper proposed a multiscale feature cross-channel attention module, namely the EMC2A module, adopting a local multiscale feature interaction strategy without dimensionality reduction.
arXiv Detail & Related papers (2022-08-03T04:31:52Z)
Activating More Pixels in Image Super-Resolution Transformer [53.87533738125943]
Transformer-based methods have shown impressive performance in low-level vision tasks, such as image super-resolution. We propose a novel Hybrid Attention Transformer (HAT) to activate more input pixels for better reconstruction. Our overall method significantly outperforms the state-of-the-art methods by more than 1dB.
arXiv Detail & Related papers (2022-05-09T17:36:58Z)
AA-RMVSNet: Adaptive Aggregation Recurrent Multi-view Stereo Network [8.127449025802436]
We present a novel recurrent multi-view stereo network based on long short-term memory (LSTM) with adaptive aggregation, namely AA-RMVSNet. We firstly introduce an intra-view aggregation module to adaptively extract image features by using context-aware convolution and multi-scale aggregation. We propose an inter-view cost volume aggregation module for adaptive pixel-wise view aggregation, which is able to preserve better-matched pairs among all views.
arXiv Detail & Related papers (2021-08-09T06:10:48Z)
MVFNet: Multi-View Fusion Network for Efficient Video Recognition [79.92736306354576]
We introduce a multi-view fusion (MVF) module to exploit video complexity using separable convolution for efficiency. MVFNet can be thought of as a generalized video modeling framework.
arXiv Detail & Related papers (2020-12-13T06:34:18Z)
Recursive Multi-model Complementary Deep Fusion forRobust Salient Object Detection via Parallel Sub Networks [62.26677215668959]
Fully convolutional networks have shown outstanding performance in the salient object detection (SOD) field. This paper proposes a wider'' network architecture which consists of parallel sub networks with totally different network architectures. Experiments on several famous benchmarks clearly demonstrate the superior performance, good generalization, and powerful learning ability of the proposed wider framework.
arXiv Detail & Related papers (2020-08-07T10:39:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.