Related papers: Semiparametric Language Models Are Scalable Continual Learners

Semiparametric Language Models Are Scalable Continual Learners

URL: http://arxiv.org/abs/2303.01421v1
Date: Thu, 2 Mar 2023 17:15:02 GMT
Title: Semiparametric Language Models Are Scalable Continual Learners
Authors: Guangyue Peng, Tao Ge, Si-Qing Chen, Furu Wei, Houfeng Wang
Abstract summary: Semiparametric language models (LMs) have shown promise in continuously learning from new text data. We present a simple and intuitive approach called Selective Memorization (SeMem) SeMem only memorizes difficult samples that the model is likely to struggle with.
Score: 83.74414880208334
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Semiparametric language models (LMs) have shown promise in continuously learning from new text data by combining a parameterized neural LM with a growable non-parametric memory for memorizing new content. However, conventional semiparametric LMs will finally become prohibitive for computing and storing if they are applied to continual learning over streaming data, because the non-parametric memory grows linearly with the amount of data they learn from over time. To address the issue of scalability, we present a simple and intuitive approach called Selective Memorization (SeMem), which only memorizes difficult samples that the model is likely to struggle with. We demonstrate that SeMem improves the scalability of semiparametric LMs for continual learning over streaming data in two ways: (1) data-wise scalability: as the model becomes stronger through continual learning, it will encounter fewer difficult cases that need to be memorized, causing the growth of the non-parametric memory to slow down over time rather than growing at a linear rate with the size of training data; (2) model-wise scalability: SeMem allows a larger model to memorize fewer samples than its smaller counterpart because it is rarer for a larger model to encounter incomprehensible cases, resulting in a non-parametric memory that does not scale linearly with model size. We conduct extensive experiments in language modeling and downstream tasks to test SeMem's results, showing SeMem enables a semiparametric LM to be a scalable continual learner with little forgetting.

Related papers

Data Mixing Can Induce Phase Transitions in Knowledge Acquisition [21.923905379872863]
We show that knowledge acquisition from knowledge-dense datasets does not always follow a smooth scaling law.<n>We show that these phase transitions are predictable, with the critical mixing ratio following a power-law relationship with the model size.
arXiv Detail & Related papers (2025-05-23T16:46:24Z)
LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive. Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones. We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z)
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios. In the early route, intermediate outputs are consolidated via an anti-redundancy operation. In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z)
CAMELoT: Towards Large Language Models with Training-Free Consolidated Associative Memory [38.429707659685974]
Large Language Models (LLMs) struggle to handle long input sequences due to high memory and runtime costs. We introduce an associative memory module which can be coupled to any pre-trained (frozen) attention-based LLM without re-training. This architecture, which we call CAMELoT, demonstrates superior performance even with a tiny context window of 128 tokens.
arXiv Detail & Related papers (2024-02-21T01:00:17Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
Can recurrent neural networks learn process model structure? [0.2580765958706854]
We introduce an evaluation framework that combines variant-based resampling and custom metrics for fitness, precision and generalization. We confirm that LSTMs can struggle to learn process model structure, even with simplistic process data. We also found that decreasing the amount of information seen by the LSTM during training, causes a sharp drop in generalization and precision scores.
arXiv Detail & Related papers (2022-12-13T08:40:01Z)
A Memory Transformer Network for Incremental Learning [64.0410375349852]
We study class-incremental learning, a training setup in which new classes of data are observed over time for the model to learn from. Despite the straightforward problem formulation, the naive application of classification models to class-incremental learning results in the "catastrophic forgetting" of previously seen classes. One of the most successful existing methods has been the use of a memory of exemplars, which overcomes the issue of catastrophic forgetting by saving a subset of past data into a memory bank and utilizing it to prevent forgetting when training future tasks.
arXiv Detail & Related papers (2022-10-10T08:27:28Z)
Incremental Online Learning Algorithms Comparison for Gesture and Visual Smart Sensors [68.8204255655161]
This paper compares four state-of-the-art algorithms in two real applications: gesture recognition based on accelerometer data and image classification. Our results confirm these systems' reliability and the feasibility of deploying them in tiny-memory MCUs.
arXiv Detail & Related papers (2022-09-01T17:05:20Z)
Quantifying Memorization Across Neural Language Models [61.58529162310382]
Large language models (LMs) have been shown to memorize parts of their training data, and when prompted appropriately, they will emit the memorized data verbatim. This is undesirable because memorization violates privacy (exposing user data), degrades utility (repeated easy-to-memorize text is often low quality), and hurts fairness (some texts are memorized over others). We describe three log-linear relationships that quantify the degree to which LMs emit memorized training data.
arXiv Detail & Related papers (2022-02-15T18:48:31Z)
Evolving Metric Learning for Incremental and Decremental Features [45.696514400861275]
We develop a new online Evolving Metric Learning model for incremental and decremental features. Our model can handle the instance and feature evolutions simultaneously by incorporating with a smoothed Wasserstein metric distance. In addition to tackling the challenges in one-shot case, we also extend our model into multishot scenario.
arXiv Detail & Related papers (2020-06-27T10:29:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.