Revisiting Dynamic Evaluation: Online Adaptation for Large Language
Models
- URL: http://arxiv.org/abs/2403.01518v1
- Date: Sun, 3 Mar 2024 14:03:48 GMT
- Title: Revisiting Dynamic Evaluation: Online Adaptation for Large Language
Models
- Authors: Amal Rannen-Triki, Jorg Bornschein, Razvan Pascanu, Marcus Hutter,
Andras Gy\"orgy, Alexandre Galashov, Yee Whye Teh, Michalis K. Titsias
- Abstract summary: We consider the problem of online fine tuning the parameters of a language model at test time, also known as dynamic evaluation.
Online adaptation turns parameters into temporally changing states and provides a form of context-length extension with memory in weights.
- Score: 88.47454470043552
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider the problem of online fine tuning the parameters of a language
model at test time, also known as dynamic evaluation. While it is generally
known that this approach improves the overall predictive performance,
especially when considering distributional shift between training and
evaluation data, we here emphasize the perspective that online adaptation turns
parameters into temporally changing states and provides a form of
context-length extension with memory in weights, more in line with the concept
of memory in neuroscience. We pay particular attention to the speed of
adaptation (in terms of sample efficiency),sensitivity to the overall
distributional drift, and the computational overhead for performing gradient
computations and parameter updates. Our empirical study provides insights on
when online adaptation is particularly interesting. We highlight that with
online adaptation the conceptual distinction between in-context learning and
fine tuning blurs: both are methods to condition the model on previously
observed tokens.
Related papers
- Information Guided Regularization for Fine-tuning Language Models [11.831883526217942]
We argue that a more surgical approach to regularization needs to exist for smoother transfer learning.
We devise a novel approach to dropout for improved model regularization and better downstream generalization.
arXiv Detail & Related papers (2024-06-20T05:18:37Z) - Online Variational Sequential Monte Carlo [49.97673761305336]
We build upon the variational sequential Monte Carlo (VSMC) method, which provides computationally efficient and accurate model parameter estimation and Bayesian latent-state inference.
Online VSMC is capable of performing efficiently, entirely on-the-fly, both parameter estimation and particle proposal adaptation.
arXiv Detail & Related papers (2023-12-19T21:45:38Z) - Fast-Slow Test-Time Adaptation for Online Vision-and-Language Navigation [67.18144414660681]
We propose a Fast-Slow Test-Time Adaptation (FSTTA) approach for online Vision-and-Language Navigation (VLN)
Our method obtains impressive performance gains on four popular benchmarks.
arXiv Detail & Related papers (2023-11-22T07:47:39Z) - Learning Neural Models for Natural Language Processing in the Face of
Distributional Shift [10.990447273771592]
The dominating NLP paradigm of training a strong neural predictor to perform one task on a specific dataset has led to state-of-the-art performance in a variety of applications.
It builds upon the assumption that the data distribution is stationary, ie. that the data is sampled from a fixed distribution both at training and test time.
This way of training is inconsistent with how we as humans are able to learn from and operate within a constantly changing stream of information.
It is ill-adapted to real-world use cases where the data distribution is expected to shift over the course of a model's lifetime
arXiv Detail & Related papers (2021-09-03T14:29:20Z) - Online Learning of a Probabilistic and Adaptive Scene Representation [31.02016059126335]
We build a consistent scene model on-the-fly for online spatial perception, interpretation, and action.
We experimentally show that the proposed representation achieves state-of-the-art accuracy with promising efficiency.
arXiv Detail & Related papers (2021-03-31T06:22:05Z) - POLA: Online Time Series Prediction by Adaptive Learning Rates [4.105553918089042]
We propose POLA to automatically regulate the learning rate of recurrent neural network models to adapt to changing time series patterns across time.
POLA demonstrates overall comparable or better predictive performance over other online prediction methods.
arXiv Detail & Related papers (2021-02-17T17:56:12Z) - Recurrent Point Review Models [1.412197703754359]
We build on deep neural network models to incorporate temporal information and model how to review data changes with time.
We use the dynamic representations of recurrent point process models, which encode the history of how business or service reviews are received in time.
We deploy our methodologies in the context of recommender systems, effectively characterizing the change in preference and taste of users as time evolves.
arXiv Detail & Related papers (2020-12-10T14:11:42Z) - Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem)
AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient.
Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z) - AdaS: Adaptive Scheduling of Stochastic Gradients [50.80697760166045]
We introduce the notions of textit"knowledge gain" and textit"mapping condition" and propose a new algorithm called Adaptive Scheduling (AdaS)
Experimentation reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training.
arXiv Detail & Related papers (2020-06-11T16:36:31Z) - Tracking Performance of Online Stochastic Learners [57.14673504239551]
Online algorithms are popular in large-scale learning settings due to their ability to compute updates on the fly, without the need to store and process data in large batches.
When a constant step-size is used, these algorithms also have the ability to adapt to drifts in problem parameters, such as data or model properties, and track the optimal solution with reasonable accuracy.
We establish a link between steady-state performance derived under stationarity assumptions and the tracking performance of online learners under random walk models.
arXiv Detail & Related papers (2020-04-04T14:16:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.