Retentive or Forgetful? Diving into the Knowledge Memorizing Mechanism
of Language Models
- URL: http://arxiv.org/abs/2305.09144v2
- Date: Wed, 13 Mar 2024 12:34:17 GMT
- Title: Retentive or Forgetful? Diving into the Knowledge Memorizing Mechanism
of Language Models
- Authors: Boxi Cao, Qiaoyu Tang, Hongyu Lin, Shanshan Jiang, Bin Dong, Xianpei
Han, Jiawei Chen, Tianshu Wang, Le Sun
- Abstract summary: Large-scale pre-trained language models have shown remarkable memorizing ability.
Vanilla neural networks without pre-training have been long observed suffering from the catastrophic forgetting problem.
We find that 1) Vanilla language models are forgetful; 2) Pre-training leads to retentive language models; 3) Knowledge relevance and diversification significantly influence the memory formation.
- Score: 49.39276272693035
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Memory is one of the most essential cognitive functions serving as a
repository of world knowledge and episodes of activities. In recent years,
large-scale pre-trained language models have shown remarkable memorizing
ability. On the contrary, vanilla neural networks without pre-training have
been long observed suffering from the catastrophic forgetting problem. To
investigate such a retentive-forgetful contradiction and understand the memory
mechanism of language models, we conduct thorough experiments by controlling
the target knowledge types, the learning strategies and the learning schedules.
We find that: 1) Vanilla language models are forgetful; 2) Pre-training leads
to retentive language models; 3) Knowledge relevance and diversification
significantly influence the memory formation. These conclusions are useful for
understanding the abilities of pre-trained language models and shed light on
designing and evaluating new learning and inference algorithms of language
models.
Related papers
- Enhancing elusive clues in knowledge learning by contrasting attention of language models [19.37767409898751]
The paper proposes a method to enhance knowledge learning during language model pretraining.
We found that larger language models pay more attention to non-obvious but important clues, which are often overlooked by smaller language models.
arXiv Detail & Related papers (2024-09-26T15:30:54Z) - Opening the black box of language acquisition [0.0]
We propose an alternative, more transparent and cognitively plausible architecture for learning language.
Instead of using deep learning, our approach uses a minimal cognitive architecture based on sequence memory and chunking.
Results show that the model can learn these artificial languages from scratch and extract grammatical information that supports learning.
arXiv Detail & Related papers (2024-02-18T19:11:58Z) - Causal Graph in Language Model Rediscovers Cortical Hierarchy in Human
Narrative Processing [0.0]
Previous studies have demonstrated that the features of language models can be mapped to fMRI brain activity.
This raises the question: is there a commonality between information processing in language models and the human brain?
To estimate information flow patterns in a language model, we examined the causal relationships between different layers.
arXiv Detail & Related papers (2023-11-17T10:09:12Z) - Measures of Information Reflect Memorization Patterns [53.71420125627608]
We show that the diversity in the activation patterns of different neurons is reflective of model generalization and memorization.
Importantly, we discover that information organization points to the two forms of memorization, even for neural activations computed on unlabelled in-distribution examples.
arXiv Detail & Related papers (2022-10-17T20:15:24Z) - Anti-Retroactive Interference for Lifelong Learning [65.50683752919089]
We design a paradigm for lifelong learning based on meta-learning and associative mechanism of the brain.
It tackles the problem from two aspects: extracting knowledge and memorizing knowledge.
It is theoretically analyzed that the proposed learning paradigm can make the models of different tasks converge to the same optimum.
arXiv Detail & Related papers (2022-08-27T09:27:36Z) - Neural Language Models are not Born Equal to Fit Brain Data, but
Training Helps [75.84770193489639]
We examine the impact of test loss, training corpus and model architecture on the prediction of functional Magnetic Resonance Imaging timecourses of participants listening to an audiobook.
We find that untrained versions of each model already explain significant amount of signal in the brain by capturing similarity in brain responses across identical words.
We suggest good practices for future studies aiming at explaining the human language system using neural language models.
arXiv Detail & Related papers (2022-07-07T15:37:17Z) - Memorization Without Overfitting: Analyzing the Training Dynamics of
Large Language Models [64.22311189896888]
We study exact memorization in causal and masked language modeling, across model sizes and throughout the training process.
Surprisingly, we show that larger models can memorize a larger portion of the data before over-fitting and tend to forget less throughout the training process.
arXiv Detail & Related papers (2022-05-22T07:43:50Z) - Counterfactual Memorization in Neural Language Models [91.8747020391287]
Modern neural language models that are widely used in various NLP tasks risk memorizing sensitive information from their training data.
An open question in previous studies of language model memorization is how to filter out "common" memorization.
We formulate a notion of counterfactual memorization which characterizes how a model's predictions change if a particular document is omitted during training.
arXiv Detail & Related papers (2021-12-24T04:20:57Z) - Brain-inspired feature exaggeration in generative replay for continual
learning [4.682734815593623]
When learning new classes, the internal representation of previously learnt ones can often be overwritten.
Recent developments in neuroscience have uncovered a method through which the brain avoids its own form of memory interference.
This paper presents a new state-of-the-art performance on the classification of early classes in the class-incremental learning dataset CIFAR100.
arXiv Detail & Related papers (2021-10-26T10:49:02Z) - Adaptive Forgetting Curves for Spaced Repetition Language Learning [6.396596455749813]
We explore a variety of forgetting curve models incorporating psychological and linguistic features.
We use these models to predict the probability of word recall by learners of English as a second language.
We find that word complexity is a highly informative feature which may be successfully learned by a neural network model.
arXiv Detail & Related papers (2020-04-23T17:22:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.