Related papers: Meta-Learning Fast Weight Language Models

Meta-Learning Fast Weight Language Models

URL: http://arxiv.org/abs/2212.02475v1
Date: Mon, 5 Dec 2022 18:37:09 GMT
Title: Meta-Learning Fast Weight Language Models
Authors: Kevin Clark, Kelvin Guu, Ming-Wei Chang, Panupong Pasupat, Geoffrey Hinton, Mohammad Norouzi
Abstract summary: We present Fast Weight Layers (FWLs), a neural component that provides the benefits of dynamic evaluation much more efficiently. FWLs can be applied at training time so the model learns to make good use of gradient updates.
Score: 105.66999854213724
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Dynamic evaluation of language models (LMs) adapts model parameters at test time using gradient information from previous tokens and substantially improves LM performance. However, it requires over 3x more compute than standard inference. We present Fast Weight Layers (FWLs), a neural component that provides the benefits of dynamic evaluation much more efficiently by expressing gradient updates as linear attention. A key improvement over dynamic evaluation is that FWLs can also be applied at training time so the model learns to make good use of gradient updates. FWLs can easily be added on top of existing transformer models, require relatively little extra compute or memory to run, and significantly improve language modeling perplexity.

Related papers

Instruction-Following Pruning for Large Language Models [58.329978053711024]
We move beyond the traditional static pruning approach of determining a fixed pruning mask for a model. In our method, the pruning mask is input-dependent and adapts dynamically based on the information described in a user instruction. Our approach, termed "instruction-following pruning", introduces a sparse mask predictor that takes the user instruction as input and dynamically selects the most relevant model parameters for the given task.
arXiv Detail & Related papers (2025-01-03T20:19:14Z)
Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review [50.78587571704713]
Learn-Focus-Review (LFR) is a dynamic training approach that adapts to the model's learning progress. LFR tracks the model's learning performance across data blocks (sequences of tokens) and prioritizes revisiting challenging regions of the dataset. Compared to baseline models trained on the full datasets, LFR consistently achieved lower perplexity and higher accuracy.
arXiv Detail & Related papers (2024-09-10T00:59:18Z)
ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets [106.7760874400261]
This paper presents ML-SUPERB2.0, which is a new benchmark for evaluating pre-trained SSL and supervised speech models. We find performance improvements over the setup of ML-SUPERB, but performance depends on the downstream model design. Also, we find large performance differences between languages and datasets, suggesting the need for more targeted approaches.
arXiv Detail & Related papers (2024-06-12T21:01:26Z)
CATfOOD: Counterfactual Augmented Training for Improving Out-of-Domain Performance and Calibration [59.48235003469116]
We show that data augmentation consistently enhances OOD performance. We also show that CF augmented models which are easier to calibrate also exhibit much lower entropy when assigning importance.
arXiv Detail & Related papers (2023-09-14T16:16:40Z)
Meta-Learning Online Adaptation of Language Models [88.8947656843812]
Large language models encode impressively broad world knowledge in their parameters. However, the knowledge in static language models falls out of date, limiting the model's effective "shelf life"
arXiv Detail & Related papers (2023-05-24T11:56:20Z)
Fine-Tuning Pre-Trained Language Models Effectively by Optimizing Subnetworks Adaptively [32.001304911395756]
We propose a Dynamic Selection (DPS) algorithm for the large-scale pre-trained models during fine-tuning. Experiments on the GLUE benchmark show that DPS outperforms previous fine-tuning methods in terms of overall performance and stability.
arXiv Detail & Related papers (2022-11-03T08:32:12Z)
Reconsidering the Past: Optimizing Hidden States in Language Models [35.7524942657169]
We present Hidden-State Optimization (HSO), a gradient-based method for improving the performance of transformer language models. HSO computes the gradient of the log-probability the language model assigns to an evaluation text, but uses it to update the cached hidden states rather than the model parameters.
arXiv Detail & Related papers (2021-12-16T06:14:37Z)
Regularized Training of Nearest Neighbor Language Models [10.994336081018043]
We build upon $k$NN-LM citepkhandelwal20generalization, which uses a pre-trained language model together with an exhaustive $k$NN search through the training data (memory bank) to achieve state-of-the-art results. We find that the added L2 regularization seems to improve the performance for high-frequency words without deteriorating the performance for low frequency ones.
arXiv Detail & Related papers (2021-09-16T23:20:24Z)
Gone Fishing: Neural Active Learning with Fisher Embeddings [55.08537975896764]
There is an increasing need for active learning algorithms that are compatible with deep neural networks. This article introduces BAIT, a practical representation of tractable, and high-performing active learning algorithm for neural networks.
arXiv Detail & Related papers (2021-06-17T17:26:31Z)
Revisiting Simple Neural Probabilistic Language Models [27.957834093475686]
This paper revisits the neural probabilistic language model (NPLM) ofcitetBengio2003ANP. When scaled up to modern hardware, this model performs much better than expected on word-level language model benchmarks. Inspired by this result, we modify the Transformer by replacing its first self-attention layer with the NPLM's local concatenation layer.
arXiv Detail & Related papers (2021-04-08T02:18:47Z)
Learning Discrete Energy-based Models via Auxiliary-variable Local Exploration [130.89746032163106]
We propose ALOE, a new algorithm for learning conditional and unconditional EBMs for discrete structured data. We show that the energy function and sampler can be trained efficiently via a new variational form of power iteration. We present an energy model guided fuzzer for software testing that achieves comparable performance to well engineered fuzzing engines like libfuzzer.
arXiv Detail & Related papers (2020-11-10T19:31:29Z)
On-the-Fly Adaptation of Source Code Models using Meta-Learning [28.98699307030983]
We frame the problem of context adaptation as a meta-learning problem. We train a base source code model that is best able to learn from information in a file to deliver improved predictions of missing tokens. We demonstrate improved performance in experiments on a large scale Java GitHub corpus.
arXiv Detail & Related papers (2020-03-26T07:11:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.