Kolmogorov–Arnold Network

Neural network architecture

Machine learning and data mining
Part of a series on
Paradigms Supervised learning Unsupervised learning Online learning Batch learning Meta-learning Semi-supervised learning Self-supervised learning Reinforcement learning Curriculum learning Rule-based learning Quantum machine learning
Problems Classification Generative modeling Regression Clustering Dimensionality reduction Density estimation Anomaly detection Data cleaning AutoML Association rules Semantic analysis Structured prediction Feature engineering Feature learning Learning to rank Grammar induction Ontology learning Multimodal learning
Supervised learning (classification • regression) Apprenticeship learning Decision trees Ensembles Bagging Boosting Random forest k-NN Linear regression Naive Bayes Artificial neural networks Logistic regression Perceptron Relevance vector machine (RVM) Support vector machine (SVM)
Clustering BIRCH CURE Hierarchical k-means Fuzzy Expectation–maximization (EM) DBSCAN OPTICS Mean shift
Dimensionality reduction Factor analysis CCA ICA LDA NMF PCA PGD t-SNE SDL
Structured prediction Graphical models Bayes net Conditional random field Hidden Markov
Anomaly detection RANSAC k-NN Local outlier factor Isolation forest
Artificial neural network Autoencoder Cognitive computing Deep learning DeepDream Feedforward neural network Kolmogorov–Arnold Network Recurrent neural network LSTM GRU ESN reservoir computing Restricted Boltzmann machine GAN Diffusion model SOM Convolutional neural network U-Net Transformer Vision Mamba Spiking neural network Memtransistor Electrochemical RAM (ECRAM)
Reinforcement learning Q-learning SARSA Temporal difference (TD) Multi-agent Self-play
Learning with humans Active learning Crowdsourcing Human-in-the-loop RLHF
Model diagnostics Coefficient of determination Confusion matrix Learning curve ROC curve
Mathematical foundations Kernel machines Bias–variance tradeoff Computational learning theory Empirical risk minimization Occam learning PAC learning Statistical learning VC theory
Machine-learning venues ECML PKDD NeurIPS ICML ICLR IJCAI ML JMLR
Related articles Glossary of artificial intelligence List of datasets for machine-learning research List of datasets in computer vision and image processing Outline of machine learning
v t e

A Kolmogorov–Arnold Network (or KAN) is a neural network that is trained by learning the activation function at each node, rather than the weight on each edge as in a traditional multilayer perceptron. This is a practical application of the Kolmogorov–Arnold representation theorem.

Description

By the Kolmogorov–Arnold representation theorem, any multivariate continuous function $f$ can be written as

f(\mathbf {x} )=f(x_{1},\ldots ,x_{n})=\sum _{q=0}^{2n}\Phi _{q}\!\left(\sum _{p=1}^{n}\phi _{q,p}(x_{p})\right)

with suitable values of the functions $\phi _{q,p}\colon [0,1]\to \mathbb {R}$ and $\Phi _{q}\colon \mathbb {R} \to \mathbb {R}$ . To create a deep network, the KAN extends this to a graph of such representations:

f(\mathbf {x} )=\sum _{i_{L-1}=1}^{n_{L-1}}\phi _{L-1,i_{L},i_{L-1}}\left(\sum _{i_{L-2}=1}^{n_{L-2}}\cdots \left(\sum _{i_{0}=1}^{n_{0}}\phi _{0,i_{1},i_{0}}(x_{i_{0}})\right)\right)

where each $\phi _{l,i,j}$ is a learnable B-spline. Since this representation is differentiable, the values can be learned by any standard backpropagation technique.

For regularization, the L1 norm of the activation function is not sufficient by itself.^[1] To keep the network sparse and prevent overfitting, an additional entropy term $S$ is introduced for each layer $\Phi$ :

S(\Phi )=-\sum _{i=1}^{n_{in}}\sum _{j=1}^{n_{out}}{\frac {|\phi _{i,j}|_{1}}{|\Phi |_{1}}}\log \left({\frac {|\phi _{i,j}|_{1}}{|\Phi |_{1}}}\right)

A linear combination of these terms and the L1 norm over all layers produces an effective regularization penalty. This sparse representation helps the deep network to overcome the curse of dimensionality.^[2]

Properties

The number of parameters in such a representation is $O(N^{2}LG)$ , where $N$ is the output dimension, $L$ is the depth of the network, and $G$ is the number of intervals over which each spline is defined. This might appear to be greater than the $O(N^{2}L)$ parameters needed to train a multilayer perceptron of depth $L$ and output dimension $N$ ; however, Liu et al. argue that in scientific domains the KAN can achieve equivalent performance with fewer parameters since many natural functions can be decomposed efficiently into splines.^[1]

KANs have been shown to perform well on problems from knot theory and physics (such as Anderson localization), although they have not yet been scaled to language models.^[1]

History

Computing the optimal Kolmogorov–Arnold representation for a given function has been a research challenge since at least 1993.^[3] Early work on applying this representation to a network used a fixed depth of two and did not appear to be a promising alternative to perceptrons for image processing.^[4]^[5]

More recently, a deep learning algorithm was proposed for constructing these representations using Urysohn operators.^[6] In 2021, the representation was successfully applied to a deeper network.^[7] In 2022, ExSpliNet brought together the Kolmogorov–Arnold representation with B-splines and probabilistic trees to show good performance on the Iris flower, MNIST, and FMNIST datasets, and argued for the better interpretability of this representation compared to perceptrons.^[8]

The term "Kolmogorov–Arnold Network" was introduced by Liu et al. in 2024, which generalizes the networks to arbitrary width and depth, and demonstrated strong performance on a realistic class of multivariate functions although training is inefficient.^[1]

References

^ ^a ^b ^c ^d Liu, Ziming; Wang, Yixuan; Vaidya, Sachin; Ruehle, Fabian; Halverson, James; Soljačić, Marin; Hou, Thomas Y.; Tegmark, Max (2024). "KAN: Kolmogorov-Arnold Networks". arXiv:2404.19756 [cs.LG].
^ Poggio, Tomaso (2022). "How deep sparse networks avoid the curse of dimensionality: Efficiently computable functions are compositionally sparse". Center for Brains, Minds and Machines Memo. 10.
^ Lin, Ji-Nan; Unbehauen, Rolf (January 1993). "On the Realization of a Kolmogorov Network". Neural Computation. 5 (1): 18–20. doi:10.1162/neco.1993.5.1.18.
^ Köppen, Mario (2022). "On the Training of a Kolmogorov Network". Artificial Neural Networks — ICANN 2002. Lecture Notes in Computer Science. 2415: 474–479. doi:10.1007/3-540-46084-5_77. ISBN 978-3-540-44074-1.
^ Leni, Pierre-Emmanuel; Fougerolle, Yohan D.; Truchetet, Frédéric (2013). "The Kolmogorov Spline Network for Image Processing". Image Processing: Concepts, Methodologies, Tools, and Applications. IGI Global. ISBN 9781466639942.
^ Polar, Andrew; Poluektov, M. (30 December 2020). "A deep machine learning algorithm for construction of the Kolmogorov–Arnold representation". Engineering Applications of Artificial Intelligence. 2020: 104–137. arXiv:2001.04652. doi:10.1016/j.engappai.2020.104137.
^ Schmidt-Hieber, Johannes (May 2021). "The Kolmogorov–Arnold representation theorem revisited". Neural Networks. 137: 119–126. doi:10.1016/j.neunet.2021.01.020.
^ Fakhoury, Daniele; Fakhoury, Emanuele; Speleers, Hendrik (August 2022). "ExSpliNet: An interpretable and expressive spline-based neural network". Neural Networks. 152: 332–346. doi:10.1016/j.neunet.2022.04.029.