Semiparametric Language Models Are Scalable Continual Learners
- URL: http://arxiv.org/abs/2303.01421v1
- Date: Thu, 2 Mar 2023 17:15:02 GMT
- Title: Semiparametric Language Models Are Scalable Continual Learners
- Authors: Guangyue Peng, Tao Ge, Si-Qing Chen, Furu Wei, Houfeng Wang
- Abstract summary: Semiparametric language models (LMs) have shown promise in continuously learning from new text data.
We present a simple and intuitive approach called Selective Memorization (SeMem)
SeMem only memorizes difficult samples that the model is likely to struggle with.
- Score: 83.74414880208334
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Semiparametric language models (LMs) have shown promise in continuously
learning from new text data by combining a parameterized neural LM with a
growable non-parametric memory for memorizing new content. However,
conventional semiparametric LMs will finally become prohibitive for computing
and storing if they are applied to continual learning over streaming data,
because the non-parametric memory grows linearly with the amount of data they
learn from over time. To address the issue of scalability, we present a simple
and intuitive approach called Selective Memorization (SeMem), which only
memorizes difficult samples that the model is likely to struggle with. We
demonstrate that SeMem improves the scalability of semiparametric LMs for
continual learning over streaming data in two ways: (1) data-wise scalability:
as the model becomes stronger through continual learning, it will encounter
fewer difficult cases that need to be memorized, causing the growth of the
non-parametric memory to slow down over time rather than growing at a linear
rate with the size of training data; (2) model-wise scalability: SeMem allows a
larger model to memorize fewer samples than its smaller counterpart because it
is rarer for a larger model to encounter incomprehensible cases, resulting in a
non-parametric memory that does not scale linearly with model size. We conduct
extensive experiments in language modeling and downstream tasks to test SeMem's
results, showing SeMem enables a semiparametric LM to be a scalable continual
learner with little forgetting.
Related papers
- SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - CAMELoT: Towards Large Language Models with Training-Free Consolidated
Associative Memory [38.429707659685974]
Large Language Models (LLMs) struggle to handle long input sequences due to high memory and runtime costs.
We introduce an associative memory module which can be coupled to any pre-trained (frozen) attention-based LLM without re-training.
This architecture, which we call CAMELoT, demonstrates superior performance even with a tiny context window of 128 tokens.
arXiv Detail & Related papers (2024-02-21T01:00:17Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Can recurrent neural networks learn process model structure? [0.2580765958706854]
We introduce an evaluation framework that combines variant-based resampling and custom metrics for fitness, precision and generalization.
We confirm that LSTMs can struggle to learn process model structure, even with simplistic process data.
We also found that decreasing the amount of information seen by the LSTM during training, causes a sharp drop in generalization and precision scores.
arXiv Detail & Related papers (2022-12-13T08:40:01Z) - A Memory Transformer Network for Incremental Learning [64.0410375349852]
We study class-incremental learning, a training setup in which new classes of data are observed over time for the model to learn from.
Despite the straightforward problem formulation, the naive application of classification models to class-incremental learning results in the "catastrophic forgetting" of previously seen classes.
One of the most successful existing methods has been the use of a memory of exemplars, which overcomes the issue of catastrophic forgetting by saving a subset of past data into a memory bank and utilizing it to prevent forgetting when training future tasks.
arXiv Detail & Related papers (2022-10-10T08:27:28Z) - Incremental Online Learning Algorithms Comparison for Gesture and Visual
Smart Sensors [68.8204255655161]
This paper compares four state-of-the-art algorithms in two real applications: gesture recognition based on accelerometer data and image classification.
Our results confirm these systems' reliability and the feasibility of deploying them in tiny-memory MCUs.
arXiv Detail & Related papers (2022-09-01T17:05:20Z) - Quantifying Memorization Across Neural Language Models [61.58529162310382]
Large language models (LMs) have been shown to memorize parts of their training data, and when prompted appropriately, they will emit the memorized data verbatim.
This is undesirable because memorization violates privacy (exposing user data), degrades utility (repeated easy-to-memorize text is often low quality), and hurts fairness (some texts are memorized over others).
We describe three log-linear relationships that quantify the degree to which LMs emit memorized training data.
arXiv Detail & Related papers (2022-02-15T18:48:31Z) - Evolving Metric Learning for Incremental and Decremental Features [45.696514400861275]
We develop a new online Evolving Metric Learning model for incremental and decremental features.
Our model can handle the instance and feature evolutions simultaneously by incorporating with a smoothed Wasserstein metric distance.
In addition to tackling the challenges in one-shot case, we also extend our model into multishot scenario.
arXiv Detail & Related papers (2020-06-27T10:29:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.