Related papers: Kolmogorov-Arnold Transformer

Kolmogorov-Arnold Transformer

URL: http://arxiv.org/abs/2409.10594v1
Date: Mon, 16 Sep 2024 17:54:51 GMT
Title: Kolmogorov-Arnold Transformer
Authors: Xingyi Yang, Xinchao Wang,
Abstract summary: We introduce the Kolmogorov-Arnold Transformer (KAT), a novel architecture that replaces layers with Kolmogorov-Arnold Network (KAN) layers. We identify three key challenges: (C1) Base function, (C2) Inefficiency, and (C3) Weight. With these designs, KAT outperforms traditional-based transformers.
Score: 72.88137795439407
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Transformers stand as the cornerstone of mordern deep learning. Traditionally, these models rely on multi-layer perceptron (MLP) layers to mix the information between channels. In this paper, we introduce the Kolmogorov-Arnold Transformer (KAT), a novel architecture that replaces MLP layers with Kolmogorov-Arnold Network (KAN) layers to enhance the expressiveness and performance of the model. Integrating KANs into transformers, however, is no easy feat, especially when scaled up. Specifically, we identify three key challenges: (C1) Base function. The standard B-spline function used in KANs is not optimized for parallel computing on modern hardware, resulting in slower inference speeds. (C2) Parameter and Computation Inefficiency. KAN requires a unique function for each input-output pair, making the computation extremely large. (C3) Weight initialization. The initialization of weights in KANs is particularly challenging due to their learnable activation functions, which are critical for achieving convergence in deep neural networks. To overcome the aforementioned challenges, we propose three key solutions: (S1) Rational basis. We replace B-spline functions with rational functions to improve compatibility with modern GPUs. By implementing this in CUDA, we achieve faster computations. (S2) Group KAN. We share the activation weights through a group of neurons, to reduce the computational load without sacrificing performance. (S3) Variance-preserving initialization. We carefully initialize the activation weights to make sure that the activation variance is maintained across layers. With these designs, KAT scales effectively and readily outperforms traditional MLP-based transformers.

Related papers

Fast and Simplex: 2-Simplicial Attention in Triton [30.500203127313576]
We investigate the use of the 2-simplicial Transformer, an architecture that generalizes standard dot-product attention to trilinear functions.<n>For a fixed token budget, similarly sized models outperform their dot-product counterparts on tasks involving mathematics, coding, reasoning, and logic.
arXiv Detail & Related papers (2025-07-03T16:16:34Z)
Spectral Compression Transformer with Line Pose Graph for Monocular 3D Human Pose Estimation [1.8999296421549172]
We introduce the Spectral Compression Transformer (SCT) to reduce sequence length and accelerate computation.<n>The LPG generates skeletal position information that complements the input 2D joint positions.<n>Our model achieves state-of-the-art performance with improved computational efficiency.
arXiv Detail & Related papers (2025-05-27T15:08:03Z)
RSQ: Learning from Important Tokens Leads to Better Quantized LLMs [65.5558181902098]
Layer-wise quantization is a key technique for efficiently compressing large models without expensive retraining. We propose RSQ (Rotate, Scale, then Quantize), which applies rotations to the model to mitigate outliers. We demonstrate that RSQ consistently outperforms baseline methods across multiple downstream tasks and three model families.
arXiv Detail & Related papers (2025-03-03T18:46:33Z)
ViKANformer: Embedding Kolmogorov Arnold Networks in Vision Transformers for Pattern-Based Learning [0.0]
Vision Transformers (ViTs) have significantly advanced image classification by applying self-attention on patch embeddings. In this paper, we propose ViKANformer, a Vision Transformer where we replace the sub-layers with Kolmogorov-Arnold Network (KAN) expansions.
arXiv Detail & Related papers (2025-03-03T03:10:26Z)
Dynamic layer selection in decoder-only transformers [21.18795712840146]
We empirically examine two common dynamic inference methods for natural language generation. We find that a pre-trained decoder-only model is significantly more robust to layer removal via layer skipping. We also show that dynamic computation allocation on a per-sequence basis holds promise for significant efficiency gains.
arXiv Detail & Related papers (2024-10-26T00:44:11Z)
Efficient Point Transformer with Dynamic Token Aggregating for Point Cloud Processing [19.73918716354272]
We propose an efficient point TransFormer with Dynamic Token Aggregating (DTA-Former) for point cloud representation and processing. It achieves SOTA performance with up to 30$times$ faster than prior point Transformers on ModelNet40, ShapeNet, and airborne MultiSpectral LiDAR (MS-LiDAR) datasets.
arXiv Detail & Related papers (2024-05-23T20:50:50Z)
Kronecker-Factored Approximate Curvature for Modern Neural Network Architectures [85.76673783330334]
Two different settings of linear weight-sharing layers motivate two flavours of Kronecker-Factored Approximate Curvature (K-FAC) We show they are exact for deep linear networks with weight-sharing in their respective setting. We observe little difference between these two K-FAC variations when using them to train both a graph neural network and a vision transformer.
arXiv Detail & Related papers (2023-11-01T16:37:00Z)
TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic Token Mixer for Visual Recognition [71.6546914957701]
We propose a lightweight Dual Dynamic Token Mixer (D-Mixer) that aggregates global information and local details in an input-dependent way. We use D-Mixer as the basic building block to design TransXNet, a novel hybrid CNN-Transformer vision backbone network. In the ImageNet-1K image classification task, TransXNet-T surpasses Swin-T by 0.3% in top-1 accuracy while requiring less than half of the computational cost.
arXiv Detail & Related papers (2023-10-30T09:35:56Z)
Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation [59.91357714415056]
We propose two Transformer variants: Context-Sharing Transformer (CST) and Semantic Gathering-Scattering Transformer (S GST) CST learns the global-shared contextual information within image frames with a lightweight computation; S GST models the semantic correlation separately for the foreground and background. Compared with the baseline that uses vanilla Transformers for multi-stage fusion, ours significantly increase the speed by 13 times and achieves new state-of-the-art ZVOS performance.
arXiv Detail & Related papers (2023-08-13T06:12:00Z)
RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z)
nnFormer: Interleaved Transformer for Volumetric Segmentation [50.10441845967601]
We introduce nnFormer, a powerful segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution. nnFormer achieves tremendous improvements over previous transformer-based methods on two commonly used datasets Synapse and ACDC.
arXiv Detail & Related papers (2021-09-07T17:08:24Z)
TCCT: Tightly-Coupled Convolutional Transformer on Time Series Forecasting [6.393659160890665]
We propose the concept of tightly-coupled convolutional Transformer(TCCT) and three TCCT architectures. Our experiments on real-world datasets show that our TCCT architectures could greatly improve the performance of existing state-of-art Transformer models.
arXiv Detail & Related papers (2021-08-29T08:49:31Z)
Improving Convolutional Neural Networks Via Conservative Field Regularisation and Integration [3.5665681694253903]
Green's function (GF) is the first operation that regularizes the 2D or 3D feature space by forcing it to be conservative and physically interpretable. Our results show that such regularization allows the network to learn faster, to have smoother training curves and to better generalize, without any additional parameter.
arXiv Detail & Related papers (2020-03-11T09:29:48Z)
A Spike in Performance: Training Hybrid-Spiking Neural Networks with Quantized Activation Functions [6.574517227976925]
Spiking Neural Network (SNN) is a promising approach to energy-efficient computing. We show how to maintain state-of-the-art accuracy when converting a non-spiking network into an SNN.
arXiv Detail & Related papers (2020-02-10T05:24:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.