PolyKAN: Efficient Fused GPU Operators for Polynomial Kolmogorov-Arnold Network Variants
- URL: http://arxiv.org/abs/2511.14852v1
- Date: Tue, 18 Nov 2025 19:05:16 GMT
- Title: PolyKAN: Efficient Fused GPU Operators for Polynomial Kolmogorov-Arnold Network Variants
- Authors: Mingkun Yu, Heming Zhong, Dan Huang, Yutong Lu, Jiazhi Jiang,
- Abstract summary: Kolmogorov-Arnold Networks (KANs) promise higher expressive capability and stronger interpretability than Multi-Layer Perceptron.<n>We present a GPU-accelerated operator library, named PolyKAN, which is the first general open-source implementation of KAN and its variants.
- Score: 10.239332579225522
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Kolmogorov-Arnold Networks (KANs) promise higher expressive capability and stronger interpretability than Multi-Layer Perceptron, particularly in the domain of AI for Science. However, practical adoption has been hindered by low GPU utilization of existing parallel implementations. To address this challenge, we present a GPU-accelerated operator library, named PolyKAN which is the first general open-source implementation of KAN and its variants. PolyKAN fuses the forward and backward passes of polynomial KAN layers into a concise set of optimized CUDA kernels. Four orthogonal techniques underpin the design: (i) \emph{lookup-table} with linear interpolation that replaces runtime expensive math-library functions; (ii) \emph{2D tiling} to expose thread-level parallelism with preserving memory locality; (iii) a \emph{two-stage reduction} scheme converting scattered atomic updates into a single controllable merge step; and (iv) \emph{coefficient-layout reordering} yielding unit-stride reads under the tiled schedule. Using a KAN variant, Chebyshev KAN, as a case-study, PolyKAN delivers $1.2$--$10\times$ faster inference and $1.4$--$12\times$ faster training than a Triton + cuBLAS baseline, with identical accuracy on speech, audio-enhancement, and tabular-regression workloads on both highend GPU and consumer-grade GPU.
Related papers
- GSPN-2: Efficient Parallel Sequence Modeling [101.33780567131716]
Generalized Spatial Propagation Network (GSPN) addresses this by replacing quadratic self-attention with a line-scan propagation scheme.<n>GSPN-2 establishes a new efficiency frontier for modeling global spatial context in vision applications.
arXiv Detail & Related papers (2025-11-28T07:26:45Z) - Evolution Strategies at the Hyperscale [57.75314521465674]
We introduce EGGROLL, an evolution strategies (ES) algorithm designed to scale backprop-free optimization to large population sizes.<n>ES is a set of powerful blackbox optimisation methods that can handle non-differentiable or noisy objectives.<n>EGGROLL overcomes these bottlenecks by generating random matrices $Ain mathbbRmtimes r, Bin mathbbRntimes r$ with $rll min(m,n)$ to form a low-rank matrix perturbation $A Btop$
arXiv Detail & Related papers (2025-11-20T18:56:05Z) - Learning Sparse Approximate Inverse Preconditioners for Conjugate Gradient Solvers on GPUs [25.22023084590467]
Concomitant gradient solver (CG) is a prevalent method for solving symmetric and positive definite linear systems Ax=b.<n>Existing learning-based methods often utilize Graph Neural Networks (GNNs) to improve the performance and speed up the construction.<n>We propose a learning-based method to generate GPU-friendly preconditioners, particularly using GNNs to construct Sparse Approximate Inverse (SPAI) preconditioners.
arXiv Detail & Related papers (2025-10-31T14:42:48Z) - Tensor Decomposition Networks for Fast Machine Learning Interatomic Potential Computations [48.46721044282335]
tensor decomposition networks (TDNs) achieve competitive performance with dramatic speedup in computations.<n>We evaluate TDNs on PubChemQCR, a newly curated molecular relaxation dataset containing 105 million DFT-calculated snapshots.<n>Results show that TDNs achieve competitive performance with dramatic speedup in computations.
arXiv Detail & Related papers (2025-07-01T18:46:27Z) - Libra: Synergizing CUDA and Tensor Cores for High-Performance Sparse Matrix Multiplication [6.557224606759151]
Modern accelerators are commonly equipped with cores and cores to accelerate sparse operators.<n>We show that utilizing one resource alone leads to inferior performance for sparse matrix multiplication, due to their respective limitations.<n>We propose a 2D-aware workload computation strategy find out the sweet point of task mapping operators, leveraging both the high performance of 2.9 cores and the low redundancy on cores.
arXiv Detail & Related papers (2025-06-28T01:50:13Z) - Second-order Optimization of Gaussian Splats with Importance Sampling [51.95046424364725]
3D Gaussian Splatting (3DGS) is widely used for novel view rendering due to its high quality and fast inference time.<n>We propose a novel second-order optimization strategy based on Levenberg-Marquardt (LM) and Conjugate Gradient (CG)<n>Our method achieves a $3times$ speedup over standard LM and outperforms Adam by $6times$ when the Gaussian count is low.
arXiv Detail & Related papers (2025-04-17T12:52:08Z) - An Efficient Sparse Kernel Generator for O(3)-Equivariant Deep Networks [0.5737287537823071]
Rotation equivariant graph neural networks yield state of the art performance on spatial deep learning tasks.<n>Key to these models is the Clebsch-Gordon (CG) tensor product, a kernel that contracts two dense feature vectors with a highly-structured sparse tensor to produce a dense output vector.<n>We introduce a GPU sparse kernel generator for the CG tensor product that provides significant speedups over the best existing open and closed-source implementations.
arXiv Detail & Related papers (2025-01-23T08:20:47Z) - Kolmogorov-Arnold Transformer [72.88137795439407]
We introduce the Kolmogorov-Arnold Transformer (KAT), a novel architecture that replaces layers with Kolmogorov-Arnold Network (KAN) layers.
We identify three key challenges: (C1) Base function, (C2) Inefficiency, and (C3) Weight.
With these designs, KAT outperforms traditional-based transformers.
arXiv Detail & Related papers (2024-09-16T17:54:51Z) - Distributed Extra-gradient with Optimal Complexity and Communication
Guarantees [60.571030754252824]
We consider monotone variational inequality (VI) problems in multi-GPU settings where multiple processors/workers/clients have access to local dual vectors.
Extra-gradient, which is a de facto algorithm for monotone VI problems, has not been designed to be communication-efficient.
We propose a quantized generalized extra-gradient (Q-GenX), which is an unbiased and adaptive compression method tailored to solve VIs.
arXiv Detail & Related papers (2023-08-17T21:15:04Z) - Tensor Slicing and Optimization for Multicore NPUs [2.670309629218727]
This paper proposes a compiler optimization pass for Multicore NPUs, called Slicing Optimization (TSO)
TSO identifies the best tensor slicing that minimizes execution time for a set of CNN models.
Results show that TSO is capable of identifying the best tensor slicing that minimizes execution time for a set of CNN models.
arXiv Detail & Related papers (2023-04-06T12:03:03Z) - Improved techniques for deterministic l2 robustness [63.34032156196848]
Training convolutional neural networks (CNNs) with a strict 1-Lipschitz constraint under the $l_2$ norm is useful for adversarial robustness, interpretable gradients and stable training.
We introduce a procedure to certify robustness of 1-Lipschitz CNNs by replacing the last linear layer with a 1-hidden layer.
We significantly advance the state-of-the-art for standard and provable robust accuracies on CIFAR-10 and CIFAR-100.
arXiv Detail & Related papers (2022-11-15T19:10:12Z) - Batch-efficient EigenDecomposition for Small and Medium Matrices [65.67315418971688]
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications.
We propose a QR-based ED method dedicated to the application scenarios of computer vision.
arXiv Detail & Related papers (2022-07-09T09:14:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.