BERTnesia: Investigating the capture and forgetting of knowledge in BERT
- URL: http://arxiv.org/abs/2106.02902v1
- Date: Sat, 5 Jun 2021 14:23:49 GMT
- Title: BERTnesia: Investigating the capture and forgetting of knowledge in BERT
- Authors: Jonas Wallat, Jaspreet Singh, Avishek Anand
- Abstract summary: We probe BERT specifically to understand and measure the relational knowledge it captures in its parametric memory.
Our findings show that knowledge is not just contained in BERT's final layers.
When BERT is fine-tuned, relational knowledge is forgotten.
- Score: 7.304523502384361
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Probing complex language models has recently revealed several insights into
linguistic and semantic patterns found in the learned representations. In this
article, we probe BERT specifically to understand and measure the relational
knowledge it captures in its parametric memory. While probing for linguistic
understanding is commonly applied to all layers of BERT as well as fine-tuned
models, this has not been done for factual knowledge. We utilize existing
knowledge base completion tasks (LAMA) to probe every layer of pre-trained as
well as fine-tuned BERT models(ranking, question answering, NER). Our findings
show that knowledge is not just contained in BERT's final layers. Intermediate
layers contribute a significant amount (17-60%) to the total knowledge found.
Probing intermediate layers also reveals how different types of knowledge
emerge at varying rates. When BERT is fine-tuned, relational knowledge is
forgotten. The extent of forgetting is impacted by the fine-tuning objective
and the training data. We found that ranking models forget the least and retain
more knowledge in their final layer compared to masked language modeling and
question-answering. However, masked language modeling performed the best at
acquiring new knowledge from the training data. When it comes to learning
facts, we found that capacity and fact density are key factors. We hope this
initial work will spur further research into understanding the parametric
memory of language models and the effect of training objectives on factual
knowledge. The code to repeat the experiments is publicly available on GitHub.
Related papers
- Does Knowledge Localization Hold True? Surprising Differences Between Entity and Relation Perspectives in Language Models [20.157061521694096]
This study investigates the differences between entity and relational knowledge through knowledge editing.
To further elucidate the differences between entity and relational knowledge, we employ causal analysis to investigate how relational knowledge is stored in pre-trained models.
This insight highlights the multifaceted nature of knowledge storage in language models, underscoring the complexity of manipulating specific types of knowledge within these models.
arXiv Detail & Related papers (2024-09-01T05:09:11Z) - How Large Language Models Encode Context Knowledge? A Layer-Wise Probing
Study [27.23388511249688]
This paper investigates the layer-wise capability of large language models to encode knowledge.
We leverage the powerful generative capability of ChatGPT to construct probing datasets.
Experiments on conflicting and newly acquired knowledge show that LLMs prefer to encode more context knowledge in the upper layers.
arXiv Detail & Related papers (2024-02-25T11:15:42Z) - Decouple knowledge from parameters for plug-and-play language modeling [77.5601135412186]
We introduce PlugLM, a pre-training model with differentiable plug-in memory(DPM)
The key intuition is to decouple the knowledge storage from model parameters with an editable and scalable key-value memory.
PlugLM obtains 3.95 F1 improvements across four domains on average without any in-domain pre-training.
arXiv Detail & Related papers (2023-05-19T10:01:55Z) - Knowledge Rumination for Pre-trained Language Models [77.55888291165462]
We propose a new paradigm dubbed Knowledge Rumination to help the pre-trained language model utilize related latent knowledge without retrieving it from the external corpus.
We apply the proposed knowledge rumination to various language models, including RoBERTa, DeBERTa, and GPT-3.
arXiv Detail & Related papers (2023-05-15T15:47:09Z) - Can BERT Refrain from Forgetting on Sequential Tasks? A Probing Study [68.75670223005716]
We find that pre-trained language models like BERT have a potential ability to learn sequentially, even without any sparse memory replay.
Our experiments reveal that BERT can actually generate high quality representations for previously learned tasks in a long term, under extremely sparse replay or even no replay.
arXiv Detail & Related papers (2023-03-02T09:03:43Z) - Knowledge Graph Fusion for Language Model Fine-tuning [0.0]
We investigate the benefits of knowledge incorporation into the fine-tuning stages of BERT.
An existing K-BERT model, which enriches sentences with triplets from a Knowledge Graph, is adapted for the English language.
Changes made to K-BERT for accommodating the English language also extend to other word-based languages.
arXiv Detail & Related papers (2022-06-21T08:06:22Z) - Finding patterns in Knowledge Attribution for Transformers [1.52292571922932]
We use a 12-layer multi-lingual BERT model for our experiments.
We observe that mostly factual knowledge can be attributed to middle and higher layers of the network.
Applying the attribution scheme for grammatical knowledge, we find that grammatical knowledge is far more dispersed among the neurons than factual knowledge.
arXiv Detail & Related papers (2022-05-03T08:30:51Z) - Towards a Universal Continuous Knowledge Base [49.95342223987143]
We propose a method for building a continuous knowledge base that can store knowledge imported from multiple neural networks.
Experiments on text classification show promising results.
We import the knowledge from multiple models to the knowledge base, from which the fused knowledge is exported back to a single model.
arXiv Detail & Related papers (2020-12-25T12:27:44Z) - BERTnesia: Investigating the capture and forgetting of knowledge in BERT [5.849736173068868]
We probe BERT specifically to understand and measure the relational knowledge it captures.
Intermediate layers contribute a significant amount (17-60%) to the total knowledge found.
When BERT is fine-tuned, relational knowledge is forgotten but the extent of forgetting is impacted by the fine-tuning objective.
arXiv Detail & Related papers (2020-10-19T08:46:30Z) - CoLAKE: Contextualized Language and Knowledge Embedding [81.90416952762803]
We propose the Contextualized Language and Knowledge Embedding (CoLAKE)
CoLAKE jointly learns contextualized representation for both language and knowledge with the extended objective.
We conduct experiments on knowledge-driven tasks, knowledge probing tasks, and language understanding tasks.
arXiv Detail & Related papers (2020-10-01T11:39:32Z) - Common Sense or World Knowledge? Investigating Adapter-Based Knowledge
Injection into Pretrained Transformers [54.417299589288184]
We investigate models for complementing the distributional knowledge of BERT with conceptual knowledge from ConceptNet and its corresponding Open Mind Common Sense (OMCS) corpus.
Our adapter-based models substantially outperform BERT on inference tasks that require the type of conceptual knowledge explicitly present in ConceptNet and OMCS.
arXiv Detail & Related papers (2020-05-24T15:49:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.