Kalman Bayesian Transformer
- URL: http://arxiv.org/abs/2509.10695v1
- Date: Fri, 12 Sep 2025 21:15:23 GMT
- Title: Kalman Bayesian Transformer
- Authors: Haoming Jing, Oren Wright, José M. F. Moura, Yorie Nakahira,
- Abstract summary: We propose a novel method that frames sequential fine-tuning as a posterior inference problem within a Bayesian framework.<n>By explicitly accounting for pre-trained models as priors and adaptively balancing them against new information based on quantified uncertainty, our method achieves robust and data-efficient sequential learning.
- Score: 11.613736588909246
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sequential fine-tuning of transformers is useful when new data arrive sequentially, especially with shifting distributions. Unlike batch learning, sequential learning demands that training be stabilized despite a small amount of data by balancing new information and previously learned knowledge in the pre-trained models. This challenge is further complicated when training is to be completed in latency-critical environments and learning must additionally quantify and be mediated by uncertainty. Motivated by these challenges, we propose a novel method that frames sequential fine-tuning as a posterior inference problem within a Bayesian framework. Our approach integrates closed-form moment propagation of random variables, Kalman Bayesian Neural Networks, and Taylor approximations of the moments of softmax functions. By explicitly accounting for pre-trained models as priors and adaptively balancing them against new information based on quantified uncertainty, our method achieves robust and data-efficient sequential learning. The effectiveness of our method is demonstrated through numerical simulations involving sequential adaptation of a decision transformer to tasks characterized by distribution shifts and limited memory resources.
Related papers
- An Efficient Continual Learning Framework for Multivariate Time Series Prediction Tasks with Application to Vehicle State Estimation [11.197937792525684]
We present EM-ReSeleCT, an enhanced approach to handle continual learning in time series tasks.<n>Our approach strategically selects representative subsets from old and historical data.<n>We also develop a sequence-to-sequence transformer model (autoregressive model) specifically designed for vehicle state estimation.
arXiv Detail & Related papers (2025-03-03T15:42:06Z) - Non-Stationary Learning of Neural Networks with Automatic Soft Parameter Reset [98.52916361979503]
We introduce a novel learning approach that automatically models and adapts to non-stationarity.
We show empirically that our approach performs well in non-stationary supervised and off-policy reinforcement learning settings.
arXiv Detail & Related papers (2024-11-06T16:32:40Z) - Dreaming Learning [41.94295877935867]
Introducing new information to a machine learning system can interfere with previously stored data.<n>We propose a training algorithm inspired by Stuart Kauffman's notion of the Adjacent Possible.<n>It predisposes the neural network to smoothly accept and integrate data sequences with different statistical characteristics than expected.
arXiv Detail & Related papers (2024-10-23T09:17:31Z) - In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent.
For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z) - Towards Robust Continual Learning with Bayesian Adaptive Moment Regularization [51.34904967046097]
Continual learning seeks to overcome the challenge of catastrophic forgetting, where a model forgets previously learnt information.
We introduce a novel prior-based method that better constrains parameter growth, reducing catastrophic forgetting.
Results show that BAdam achieves state-of-the-art performance for prior-based methods on challenging single-headed class-incremental experiments.
arXiv Detail & Related papers (2023-09-15T17:10:51Z) - Complementary Learning Subnetworks for Parameter-Efficient
Class-Incremental Learning [40.13416912075668]
We propose a rehearsal-free CIL approach that learns continually via the synergy between two Complementary Learning Subnetworks.
Our method achieves competitive results against state-of-the-art methods, especially in accuracy gain, memory cost, training efficiency, and task-order.
arXiv Detail & Related papers (2023-06-21T01:43:25Z) - Continual Learning with Pretrained Backbones by Tuning in the Input
Space [44.97953547553997]
The intrinsic difficulty in adapting deep learning models to non-stationary environments limits the applicability of neural networks to real-world tasks.
We propose a novel strategy to make the fine-tuning procedure more effective, by avoiding to update the pre-trained part of the network and learning not only the usual classification head, but also a set of newly-introduced learnable parameters.
arXiv Detail & Related papers (2023-06-05T15:11:59Z) - Adapting to Continuous Covariate Shift via Online Density Ratio Estimation [64.8027122329609]
Dealing with distribution shifts is one of the central challenges for modern machine learning.
We propose an online method that can appropriately reuse historical information.
Our density ratio estimation method is proven to perform well by enjoying a dynamic regret bound.
arXiv Detail & Related papers (2023-02-06T04:03:33Z) - Stabilizing Machine Learning Prediction of Dynamics: Noise and
Noise-inspired Regularization [58.720142291102135]
Recent has shown that machine learning (ML) models can be trained to accurately forecast the dynamics of chaotic dynamical systems.
In the absence of mitigating techniques, this technique can result in artificially rapid error growth, leading to inaccurate predictions and/or climate instability.
We introduce Linearized Multi-Noise Training (LMNT), a regularization technique that deterministically approximates the effect of many small, independent noise realizations added to the model input during training.
arXiv Detail & Related papers (2022-11-09T23:40:52Z) - AdaS: Adaptive Scheduling of Stochastic Gradients [50.80697760166045]
We introduce the notions of textit"knowledge gain" and textit"mapping condition" and propose a new algorithm called Adaptive Scheduling (AdaS)
Experimentation reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training.
arXiv Detail & Related papers (2020-06-11T16:36:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.