Study of Training Dynamics for Memory-Constrained Fine-Tuning
- URL: http://arxiv.org/abs/2510.19675v1
- Date: Wed, 22 Oct 2025 15:21:05 GMT
- Title: Study of Training Dynamics for Memory-Constrained Fine-Tuning
- Authors: Aël Quélennec, Nour Hezbri, Pavlo Mozharovskyi, Van-Tam Nguyen, Enzo Tartaglione,
- Abstract summary: TraDy is a novel transfer learning scheme for deep neural networks.<n>It achieves state-of-the-art performance across various downstream tasks and architectures.<n>It maintains strict memory constraints, achieving up to 99% activation sparsity, 95% weight derivative sparsity, and 97% reduction in FLOPs for weight derivative computation.
- Score: 19.283663659539588
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Memory-efficient training of deep neural networks has become increasingly important as models grow larger while deployment environments impose strict resource constraints. We propose TraDy, a novel transfer learning scheme leveraging two key insights: layer importance for updates is architecture-dependent and determinable a priori, while dynamic stochastic channel selection provides superior gradient approximation compared to static approaches. We introduce a dynamic channel selection approach that stochastically resamples channels between epochs within preselected layers. Extensive experiments demonstrate TraDy achieves state-of-the-art performance across various downstream tasks and architectures while maintaining strict memory constraints, achieving up to 99% activation sparsity, 95% weight derivative sparsity, and 97% reduction in FLOPs for weight derivative computation.
Related papers
- Memory Constrained Dynamic Subnetwork Update for Transfer Learning [20.05842386680307]
MeDyate is a theoretically-grounded framework for memory-constrained dynamic subnetwork adaptation.<n>MeDyate achieves state-of-the-art performance under extreme memory constraints.
arXiv Detail & Related papers (2025-10-23T20:16:43Z) - Towards the Training of Deeper Predictive Coding Neural Networks [44.14001498773255]
Predictive coding networks are neural models that perform inference through an iterative energy minimization process.<n>While effective in shallow architectures, they suffer significant performance degradation beyond five to seven layers.<n>We show that this degradation is caused by exponentially imbalanced errors between layers during weight updates, and by predictions from the previous layers not being effective in guiding updates in deeper layers.
arXiv Detail & Related papers (2025-06-30T12:44:47Z) - Stochastic Engrams for Efficient Continual Learning with Binarized Neural Networks [4.014396794141682]
We propose a novel approach that integrates plasticityally-activated engrams as a gating mechanism for metaplastic binarized neural networks (mBNNs)<n>Our findings demonstrate (A) an improved stability vs. a trade-off, (B) a reduced memory intensiveness, and (C) an enhanced performance in binarized architectures.
arXiv Detail & Related papers (2025-03-27T12:21:00Z) - SURGEON: Memory-Adaptive Fully Test-Time Adaptation via Dynamic Activation Sparsity [30.260783715373382]
Test-time adaptation (TTA) has emerged to improve the performance of deep models by adapting them to unlabeled target data online.<n>Yet, the significant memory cost, particularly in resource-constrained terminals, impedes the effective deployment of most backward-propagation-based TTA methods.<n>To tackle memory constraints, we introduce SURGEON, a method that substantially reduces memory cost while preserving comparable accuracy improvements.
arXiv Detail & Related papers (2025-03-26T09:27:09Z) - Deep Linear Network Training Dynamics from Random Initialization: Data, Width, Depth, and Hyperparameter Transfer [40.40780546513363]
We provide descriptions of both non-residual and residual neural networks, the latter of which enables an infinite depth limit when branches are scaled as $1/sqrttextdepth$.<n>We show that this model recovers the accelerated power law training dynamics for power law structured data in the rich regime observed in recent works.
arXiv Detail & Related papers (2025-02-04T17:50:55Z) - Diffusion-Based Neural Network Weights Generation [80.89706112736353]
D2NWG is a diffusion-based neural network weights generation technique that efficiently produces high-performing weights for transfer learning.
Our method extends generative hyper-representation learning to recast the latent diffusion paradigm for neural network weights generation.
Our approach is scalable to large architectures such as large language models (LLMs), overcoming the limitations of current parameter generation techniques.
arXiv Detail & Related papers (2024-02-28T08:34:23Z) - Robust Neural Pruning with Gradient Sampling Optimization for Residual Neural Networks [0.0]
This research embarks on pioneering the integration of gradient sampling optimization techniques, particularly StochGradAdam, into the pruning process of neural networks.
Our main objective is to address the significant challenge of maintaining accuracy in pruned neural models, critical in resource-constrained scenarios.
arXiv Detail & Related papers (2023-12-26T12:19:22Z) - ConCerNet: A Contrastive Learning Based Framework for Automated
Conservation Law Discovery and Trustworthy Dynamical System Prediction [82.81767856234956]
This paper proposes a new learning framework named ConCerNet to improve the trustworthiness of the DNN based dynamics modeling.
We show that our method consistently outperforms the baseline neural networks in both coordinate error and conservation metrics.
arXiv Detail & Related papers (2023-02-11T21:07:30Z) - FOSTER: Feature Boosting and Compression for Class-Incremental Learning [52.603520403933985]
Deep neural networks suffer from catastrophic forgetting when learning new categories.
We propose a novel two-stage learning paradigm FOSTER, empowering the model to learn new categories adaptively.
arXiv Detail & Related papers (2022-04-10T11:38:33Z) - Learning to Continuously Optimize Wireless Resource in a Dynamic
Environment: A Bilevel Optimization Perspective [52.497514255040514]
This work develops a new approach that enables data-driven methods to continuously learn and optimize resource allocation strategies in a dynamic environment.
We propose to build the notion of continual learning into wireless system design, so that the learning model can incrementally adapt to the new episodes.
Our design is based on a novel bilevel optimization formulation which ensures certain fairness" across different data samples.
arXiv Detail & Related papers (2021-05-03T07:23:39Z) - An Ode to an ODE [78.97367880223254]
We present a new paradigm for Neural ODE algorithms, called ODEtoODE, where time-dependent parameters of the main flow evolve according to a matrix flow on the group O(d)
This nested system of two flows provides stability and effectiveness of training and provably solves the gradient vanishing-explosion problem.
arXiv Detail & Related papers (2020-06-19T22:05:19Z) - Dynamic Hierarchical Mimicking Towards Consistent Optimization
Objectives [73.15276998621582]
We propose a generic feature learning mechanism to advance CNN training with enhanced generalization ability.
Partially inspired by DSN, we fork delicately designed side branches from the intermediate layers of a given neural network.
Experiments on both category and instance recognition tasks demonstrate the substantial improvements of our proposed method.
arXiv Detail & Related papers (2020-03-24T09:56:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.