Hyb-KAN ViT: Hybrid Kolmogorov-Arnold Networks Augmented Vision Transformer
- URL: http://arxiv.org/abs/2505.04740v1
- Date: Wed, 07 May 2025 19:13:17 GMT
- Title: Hyb-KAN ViT: Hybrid Kolmogorov-Arnold Networks Augmented Vision Transformer
- Authors: Sainath Dey, Mitul Goswami, Jashika Sethi, Prasant Kumar Pattnaik,
- Abstract summary: This study introduces Hybrid Kolmogorov-Arnold Network (KAN)-T (Hyb-KAN ViT) to address the inherent limitations of Multi-Arnold Perceptrons (MLPs) in Vision Transformers (ViTs)<n>Hyb-KAN ViT is a novel framework that integrates wavelet-based spectral decomposition and spline-optimized activation functions.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study addresses the inherent limitations of Multi-Layer Perceptrons (MLPs) in Vision Transformers (ViTs) by introducing Hybrid Kolmogorov-Arnold Network (KAN)-ViT (Hyb-KAN ViT), a novel framework that integrates wavelet-based spectral decomposition and spline-optimized activation functions, prior work has failed to focus on the prebuilt modularity of the ViT architecture and integration of edge detection capabilities of Wavelet functions. We propose two key modules: Efficient-KAN (Eff-KAN), which replaces MLP layers with spline functions and Wavelet-KAN (Wav-KAN), leveraging orthogonal wavelet transforms for multi-resolution feature extraction. These modules are systematically integrated in ViT encoder layers and classification heads to enhance spatial-frequency modeling while mitigating computational bottlenecks. Experiments on ImageNet-1K (Image Recognition), COCO (Object Detection and Instance Segmentation), and ADE20K (Semantic Segmentation) demonstrate state-of-the-art performance with Hyb-KAN ViT. Ablation studies validate the efficacy of wavelet-driven spectral priors in segmentation and spline-based efficiency in detection tasks. The framework establishes a new paradigm for balancing parameter efficiency and multi-scale representation in vision architectures.
Related papers
- SIEFormer: Spectral-Interpretable and -Enhanced Transformer for Generalized Category Discovery [14.288193104482986]
SIEFormer is composed of two main branches, each corresponding to an implicit and explicit spectral perspective of the ViT.<n>The implicit branch realizes the use of different types of graph Laplacians to model the local structure correlations of tokens.<n>The explicit branch, on the other hand, introduces a Maneuverable Filtering Layer (MFL) that learns global dependencies among tokens.
arXiv Detail & Related papers (2026-02-13T16:22:31Z) - SFFR: Spatial-Frequency Feature Reconstruction for Multispectral Aerial Object Detection [12.521255528136278]
We propose a novel Spatial and Frequency Feature Reconstruction method (SFFR) method.<n>It reconstructs complementary representations in both spatial and frequency domains prior to feature fusion.<n>It is experimentally validated that our proposed FCEKAN and MSGKAN modules are complementary and can effectively capture the frequency and spatial semantic features respectively.
arXiv Detail & Related papers (2025-11-09T09:34:10Z) - A Cross-Hierarchical Multi-Feature Fusion Network Based on Multiscale Encoder-Decoder for Hyperspectral Change Detection [3.5421087596321352]
This paper proposes a cross-hierarchical multi-feature fusion network (CHMFFN) based on a multiscale encoder-decoder architecture.<n> Experiments on four public hyperspectral datasets show CHMFFN outperforms state-of-the-art methods, verifying its effectiveness.
arXiv Detail & Related papers (2025-09-21T09:04:28Z) - GCRPNet: Graph-Enhanced Contextual and Regional Perception Network for Salient Object Detection in Optical Remote Sensing Images [68.33481681452675]
We propose a graph-enhanced contextual and regional perception network (GCRPNet)<n>It builds upon the Mamba architecture to simultaneously capture long-range dependencies and enhance regional feature representation.<n>It performs adaptive patch scanning on feature maps processed via multi-scale convolutions, thereby capturing rich local region information.
arXiv Detail & Related papers (2025-08-14T11:31:43Z) - Frequency-Dynamic Attention Modulation for Dense Prediction [14.066404173580864]
We propose a circuit-theory-inspired strategy called Frequency-Dynamic Attention Modulation (FDAM)<n>FDAM directly modulates the overall frequency response of Vision Transformers (ViTs)
arXiv Detail & Related papers (2025-07-16T07:59:54Z) - VRS-UIE: Value-Driven Reordering Scanning for Underwater Image Enhancement [104.78586859995333]
State Space Models (SSMs) have emerged as a promising backbone for vision tasks due to their linear complexity and global receptive field.<n>The predominance of large-portion, homogeneous but useless oceanic backgrounds can dilute the feature representation responses of sparse yet valuable targets.<n>We propose a novel Value-Driven Reordering Scanning framework for Underwater Image Enhancement (UIE)<n>Our framework sets a new state-of-the-art, delivering superior enhancement performance (surpassing WMamba by 0.89 dB on average) by effectively suppressing water bias and preserving structural and color fidelity.
arXiv Detail & Related papers (2025-05-02T12:21:44Z) - FreSca: Scaling in Frequency Space Enhances Diffusion Models [55.75504192166779]
This paper explores frequency-based control within latent diffusion models.<n>We introduce FreSca, a novel framework that decomposes noise difference into low- and high-frequency components.<n>FreSca operates without any model retraining or architectural change, offering model- and task-agnostic control.
arXiv Detail & Related papers (2025-04-02T22:03:11Z) - BHViT: Binarized Hybrid Vision Transformer [53.38894971164072]
Model binarization has made significant progress in enabling real-time and energy-efficient computation for convolutional neural networks (CNN)<n>We propose BHViT, a binarization-friendly hybrid ViT architecture and its full binarization model with the guidance of three important observations.<n>Our proposed algorithm achieves SOTA performance among binary ViT methods.
arXiv Detail & Related papers (2025-03-04T08:35:01Z) - Spatial-Spectral Diffusion Contrastive Representation Network for Hyperspectral Image Classification [8.600534616819333]
This paper presents a Spatial-Spectral Diffusion Contrastive Representation Network (DiffCRN)<n>DiffCRN is based on denoising diffusion probabilistic model (DDPM) combined with contrastive learning (CL) for hyperspectral images classification.<n> Experiments conducted on widely used four HSI datasets demonstrate the improved performance of the proposed DiffCRN.
arXiv Detail & Related papers (2025-02-27T02:34:23Z) - FE-UNet: Frequency Domain Enhanced U-Net with Segment Anything Capability for Versatile Image Segmentation [50.9040167152168]
We experimentally quantify the contrast sensitivity function of CNNs and compare it with that of the human visual system.<n>We propose the Wavelet-Guided Spectral Pooling Module (WSPM) to enhance and balance image features across the frequency domain.<n>To further emulate the human visual system, we introduce the Frequency Domain Enhanced Receptive Field Block (FE-RFB)<n>We develop FE-UNet, a model that utilizes SAM2 as its backbone and incorporates Hiera-Large as a pre-trained block.
arXiv Detail & Related papers (2025-02-06T07:24:34Z) - A Hybrid Transformer-Mamba Network for Single Image Deraining [70.64069487982916]
Existing deraining Transformers employ self-attention mechanisms with fixed-range windows or along channel dimensions.
We introduce a novel dual-branch hybrid Transformer-Mamba network, denoted as TransMamba, aimed at effectively capturing long-range rain-related dependencies.
arXiv Detail & Related papers (2024-08-31T10:03:19Z) - PVAFN: Point-Voxel Attention Fusion Network with Multi-Pooling Enhancing for 3D Object Detection [59.355022416218624]
integration of point and voxel representations is becoming more common in LiDAR-based 3D object detection.
We propose a novel two-stage 3D object detector, called Point-Voxel Attention Fusion Network (PVAFN)
PVAFN uses a multi-pooling strategy to integrate both multi-scale and region-specific information effectively.
arXiv Detail & Related papers (2024-08-26T19:43:01Z) - Unveiling the Power of Wavelets: A Wavelet-based Kolmogorov-Arnold Network for Hyperspectral Image Classification [3.009351592961681]
This paper employs Wavelet-based Kolmogorov-Arnold Network (wav-kan) architecture tailored for efficient modeling of complex dependencies.
Wavelet-based activation allows Wav-KAN to effectively capture multi-scale spatial and spectral patterns.
arXiv Detail & Related papers (2024-06-12T04:52:40Z) - Hybrid Convolutional and Attention Network for Hyperspectral Image Denoising [54.110544509099526]
Hyperspectral image (HSI) denoising is critical for the effective analysis and interpretation of hyperspectral data.
We propose a hybrid convolution and attention network (HCANet) to enhance HSI denoising.
Experimental results on mainstream HSI datasets demonstrate the rationality and effectiveness of the proposed HCANet.
arXiv Detail & Related papers (2024-03-15T07:18:43Z) - Transformer variational wave functions for frustrated quantum spin
systems [0.0]
We propose an adaptation of the ViT architecture with complex parameters to define a new class of variational neural-network states.
The success of the ViT wave function relies on mixing both local and global operations.
arXiv Detail & Related papers (2022-11-10T11:56:44Z) - RetiFluidNet: A Self-Adaptive and Multi-Attention Deep Convolutional
Network for Retinal OCT Fluid Segmentation [3.57686754209902]
Quantification of retinal fluids is necessary for OCT-guided treatment management.
New convolutional neural architecture named RetiFluidNet is proposed for multi-class retinal fluid segmentation.
Model benefits from hierarchical representation learning of textural, contextual, and edge features.
arXiv Detail & Related papers (2022-09-26T07:18:00Z) - PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result.
Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.