Related papers: Bridging Training and Merging Through Momentum-Aware Optimization

Bridging Training and Merging Through Momentum-Aware Optimization

URL: http://arxiv.org/abs/2512.17109v1
Date: Thu, 18 Dec 2025 22:37:33 GMT
Title: Bridging Training and Merging Through Momentum-Aware Optimization
Authors: Alireza Moayedikia, Alicia Troncoso,
Abstract summary: Training large neural networks and task-specific computation models require parameter importance estimation.<n>Current compute curvature information during training, discard it, then recompute similar information for merging.<n>We introduce a unified framework that factorized momentum and curvature statistics during training, then recompute similar information for merging.
Score: 8.035521056416242
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Training large neural networks and merging task-specific models both exploit low-rank structure and require parameter importance estimation, yet these challenges have been pursued in isolation. Current workflows compute curvature information during training, discard it, then recompute similar information for merging -- wasting computation and discarding valuable trajectory data. We introduce a unified framework that maintains factorized momentum and curvature statistics during training, then reuses this information for geometry-aware model composition. The proposed method achieves memory efficiency comparable to state-of-the-art approaches while accumulating task saliency scores that enable curvature-aware merging without post-hoc Fisher computation. We establish convergence guarantees for non-convex objectives with approximation error bounded by gradient singular value decay. On natural language understanding benchmarks, curvature-aware parameter selection outperforms magnitude-only baselines across all sparsity levels, with multi-task merging improving over strong baselines. The proposed framework exhibits rank-invariant convergence and superior hyperparameter robustness compared to existing low-rank optimizers. By treating the optimization trajectory as a reusable asset rather than discarding it, our approach eliminates redundant computation while enabling more principled model composition.

Related papers

Scalable Gaussian process modeling of parametrized spatio-temporal fields [2.005299372367689]
We develop a scalable framework for learning of parametized equations over fixed or parameter-temporal domains.<n>A key feature of our approach is the efficient computation of the posterior variance at essentially the same computational cost as the posterior mean.<n>Results establish the proposed framework as an effective tool for data-driven surrogate modeling, particularly when uncertainty estimates are required for downstream tasks.
arXiv Detail & Related papers (2026-02-27T20:16:21Z)
Closing the Generalization Gap in Parameter-efficient Federated Edge Learning [43.00634399799955]
Federated edge learning (FEEL) provides a promising foundation for artificial intelligence (AI)<n>limited and heterogeneous local datasets, as well as resource-constrained deployment, severely degrade both model generalization and resource utilization.<n>We propose a framework that jointly leverages model minimization and generalization selection to tackle such challenges.
arXiv Detail & Related papers (2025-11-28T15:34:09Z)
OFMU: Optimization-Driven Framework for Machine Unlearning [5.100622189286672]
Large language models increasingly require the ability to unlearn specific knowledge, such as user requests, copyrighted materials, or outdated information.<n>We propose OFMU, a penalty-based bi-level optimization framework that explicitly prioritizes forgetting while preserving retention.<n>We show that OFMU consistently outperforms existing unlearning methods in both efficacy and retained utility.
arXiv Detail & Related papers (2025-09-26T15:31:32Z)
On Information Geometry and Iterative Optimization in Model Compression: Operator Factorization [5.952537659103525]
We argue that many successful model compression approaches can be understood as implicitly approximating information divergences for this projection.<n>We prove convergence of iterative singular value thresholding for training neural networks subject to a soft rank constraint.
arXiv Detail & Related papers (2025-07-12T23:39:14Z)
Train with Perturbation, Infer after Merging: A Two-Stage Framework for Continual Learning [57.514786046966265]
We propose textbfPerturb-and-Merge (P&M), a novel continual learning framework that integrates model merging into the CL paradigm to mitigate forgetting.<n>Our proposed approach achieves state-of-the-art performance on several continual learning benchmark datasets.
arXiv Detail & Related papers (2025-05-28T14:14:19Z)
NAN: A Training-Free Solution to Coefficient Estimation in Model Merging [61.36020737229637]
We show that the optimal merging weights should scale with the amount of task-specific information encoded in each model.<n>We propose NAN, a simple yet effective method that estimates model merging coefficients via the inverse of parameter norm.<n>NAN is training-free, plug-and-play, and applicable to a wide range of merging strategies.
arXiv Detail & Related papers (2025-05-22T02:46:08Z)
Self-Boost via Optimal Retraining: An Analysis via Approximate Message Passing [58.52119063742121]
Retraining a model using its own predictions together with the original, potentially noisy labels is a well-known strategy for improving the model performance.<n>This paper addresses the question of how to optimally combine the model's predictions and the provided labels.<n>Our main contribution is the derivation of the Bayes optimal aggregator function to combine the current model's predictions and the given labels.
arXiv Detail & Related papers (2025-05-21T07:16:44Z)
Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization. A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR. For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z)
Robust Hyperbolic Learning with Curvature-Aware Optimization [7.89323764547292]
Current hyperbolic learning approaches are prone to overfitting, computationally expensive, and prone to instability.<n>We introduce a novel fine-tunable hyperbolic scaling approach to constrain hyperbolic embeddings reduce approximation errors.<n>Our approach demonstrates consistent improvements across Computer Vision, EEG classification, and hierarchical metric learning tasks.
arXiv Detail & Related papers (2024-05-22T20:30:14Z)
Consensus-Adaptive RANSAC [104.87576373187426]
We propose a new RANSAC framework that learns to explore the parameter space by considering the residuals seen so far via a novel attention layer. The attention mechanism operates on a batch of point-to-model residuals, and updates a per-point estimation state to take into account the consensus found through a lightweight one-step transformer.
arXiv Detail & Related papers (2023-07-26T08:25:46Z)
Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching. Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z)
Relative gradient optimization of the Jacobian term in unsupervised deep learning [9.385902422987677]
Learning expressive probabilistic models correctly describing the data is a ubiquitous problem in machine learning. Deep density models have been widely used for this task, but their maximum likelihood based training requires estimating the log-determinant of the Jacobian. We propose a new approach for exact training of such neural networks.
arXiv Detail & Related papers (2020-06-26T16:41:08Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.