Related papers: DyKAF: Dynamical Kronecker Approximation of the Fisher Information Matrix for Gradient Preconditioning

DyKAF: Dynamical Kronecker Approximation of the Fisher Information Matrix for Gradient Preconditioning

URL: http://arxiv.org/abs/2511.06477v1
Date: Sun, 09 Nov 2025 17:48:26 GMT
Title: DyKAF: Dynamical Kronecker Approximation of the Fisher Information Matrix for Gradient Preconditioning
Authors: Nikolay Yudin, Ekaterina Grishina, Andrey Veprikov, Alexandr Beznosikov, Maxim Rakhuba,
Abstract summary: We introduce DyKAF (Dynamic Kronalecker Approximation of the Fisher Matrix), which consistently improves the Fisher matrix quality pre-tuning model.<n>Our experiments demonstrate that DyKAF outperforms existing approximations across a range of evaluation metrics.
Score: 47.17050585542348
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, optimizers that explicitly treat weights as matrices, rather than flattened vectors, have demonstrated their effectiveness. This perspective naturally leads to structured approximations of the Fisher matrix as preconditioners, where the matrix view induces a Kronecker-factorized form that enables memory-efficient representation. However, constructing such approximations both efficiently and accurately remains an open challenge, since obtaining the optimal factorization is resource-intensive and practical methods therefore rely on heuristic design choices. In this work, we introduce a novel approach that leverages projector-splitting integrators to construct effective preconditioners. Our optimizer, DyKAF (Dynamical Kronecker Approximation of the Fisher Matrix), consistently improves the Fisher matrix approximation quality. Experiments on large language model pre-training and fine-tuning demonstrate that DyKAF outperforms existing optimizers across a range of evaluation metrics.

Related papers

FISMO: Fisher-Structured Momentum-Orthogonalized Optimizer [30.184978506988767]
We introduce FISMO, which incorporates anisotropic neuralotropic geometry information through Fisher information geometry.<n> FISMO achieves superior efficiency and final performance compared to established baselines.
arXiv Detail & Related papers (2026-01-29T14:05:04Z)
Self-Boost via Optimal Retraining: An Analysis via Approximate Message Passing [58.52119063742121]
Retraining a model using its own predictions together with the original, potentially noisy labels is a well-known strategy for improving the model performance.<n>This paper addresses the question of how to optimally combine the model's predictions and the provided labels.<n>Our main contribution is the derivation of the Bayes optimal aggregator function to combine the current model's predictions and the given labels.
arXiv Detail & Related papers (2025-05-21T07:16:44Z)
Spectrum-Aware Parameter Efficient Fine-Tuning for Diffusion Models [73.88009808326387]
We propose a novel spectrum-aware adaptation framework for generative models. Our method adjusts both singular values and their basis vectors of pretrained weights. We introduce Spectral Ortho Decomposition Adaptation (SODA), which balances computational efficiency and representation capacity.
arXiv Detail & Related papers (2024-05-31T17:43:35Z)
Regularized Projection Matrix Approximation with Applications to Community Detection [1.3761665705201904]
This paper introduces a regularized projection matrix approximation framework designed to recover cluster information from the affinity matrix. We investigate three distinct penalty functions, each specifically tailored to address bounded, positive, and sparse scenarios. Numerical experiments conducted on both synthetic and real-world datasets reveal that our regularized projection matrix approximation approach significantly outperforms state-of-the-art methods in clustering performance.
arXiv Detail & Related papers (2024-05-26T15:18:22Z)
Sparse high-dimensional linear regression with a partitioned empirical Bayes ECM algorithm [62.997667081978825]
We propose a computationally efficient and powerful Bayesian approach for sparse high-dimensional linear regression. Minimal prior assumptions on the parameters are used through the use of plug-in empirical Bayes estimates. The proposed approach is implemented in the R package probe.
arXiv Detail & Related papers (2022-09-16T19:15:50Z)
An Adaptive Alternating-direction-method-based Nonnegative Latent Factor Model [2.857044909410376]
An alternating-direction-method-based nonnegative latent factor model can perform efficient representation learning to a high-dimensional and incomplete (HDI) matrix. This paper proposes an Adaptive Alternating-direction-method-based Nonnegative Latent Factor model, whose hyper- parameter adaptation is implemented following the principle of particle swarm optimization. Empirical studies on nonnegative HDI matrices generated by industrial applications indicate that A2NLF outperforms several state-of-the-art models in terms of computational and storage efficiency, as well as maintains highly competitive estimation accuracy for an HDI matrix's missing data
arXiv Detail & Related papers (2022-04-11T03:04:26Z)
Two-Level K-FAC Preconditioning for Deep Learning [7.699428789159717]
In the context of deep learning, many optimization methods use gradient covariance information in order to accelerate the convergence of Gradient Descent. In particular, starting with Adagrad, a seemingly endless line of research advocates the use of diagonal approximations of the so-called empirical Fisher matrix. One particularly successful variant of such methods is the so-called K-FAC, which uses a Kronecker-ed block-factored preconditioner.
arXiv Detail & Related papers (2020-11-01T17:54:21Z)
Efficient Model-Based Collaborative Filtering with Fast Adaptive PCA [4.878057307346225]
A model-based collaborative filtering (CF) approach utilizing fast adaptive randomized singular value decomposition (SVD) is proposed. A novel termination mechanism for the adaptive PCA is proposed to automatically determine a number of latent factors for achieving the near optimal prediction accuracy. The proposed model-based CF approach is able to efficiently process Matlab MovieLen data with 20M ratings and exhibits more than 10X speedup over the regularized factorization based approach.
arXiv Detail & Related papers (2020-09-04T15:32:14Z)
Robust, Accurate Stochastic Optimization for Variational Inference [68.83746081733464]
We show that common optimization methods lead to poor variational approximations if the problem is moderately large. Motivated by these findings, we develop a more robust and accurate optimization framework by viewing the underlying algorithm as producing a Markov chain.
arXiv Detail & Related papers (2020-09-01T19:12:11Z)
Augmentation of the Reconstruction Performance of Fuzzy C-Means with an Optimized Fuzzification Factor Vector [99.19847674810079]
Fuzzy C-Means (FCM) is one of the most frequently used methods to construct information granules. In this paper, we augment the FCM-based degranulation mechanism by introducing a vector of fuzzification factors. Experiments completed for both synthetic and publicly available datasets show that the proposed approach outperforms the generic data reconstruction approach.
arXiv Detail & Related papers (2020-04-13T04:17:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.