Protoknowledge Shapes Behaviour of LLMs in Downstream Tasks: Memorization and Generalization with Knowledge Graphs
- URL: http://arxiv.org/abs/2505.15501v1
- Date: Wed, 21 May 2025 13:22:34 GMT
- Title: Protoknowledge Shapes Behaviour of LLMs in Downstream Tasks: Memorization and Generalization with Knowledge Graphs
- Authors: Federico Ranaldi, Andrea Zugarini, Leonardo Ranaldi, Fabio Massimo Zanzotto,
- Abstract summary: We introduce the concept of protoknowledge to formalize and measure how sequences of tokens encoding Knowledge Graphs are internalized during pretraining.<n>We categorize protoknowledge into lexical, hierarchical, and topological forms, varying on the type of knowledge that needs to be activated.
- Score: 1.9249287163937978
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce the concept of protoknowledge to formalize and measure how sequences of tokens encoding Knowledge Graphs are internalized during pretraining and utilized at inference time by Large Language Models (LLMs). Indeed, LLMs have demonstrated the ability to memorize vast amounts of token sequences during pretraining, and a central open question is how they leverage this memorization as reusable knowledge through generalization. We then categorize protoknowledge into lexical, hierarchical, and topological forms, varying on the type of knowledge that needs to be activated. We measure protoknowledge through Knowledge Activation Tasks (KATs), analyzing its general properties such as semantic bias. We then investigate the impact of protoknowledge on Text-to-SPARQL performance by varying prompting strategies depending on input conditions. To this end, we adopt a novel analysis framework that assesses whether model predictions align with the successful activation of the relevant protoknowledge for each query. This methodology provides a practical tool to explore Semantic-Level Data Contamination and serves as an effective strategy for Closed-Pretraining models.
Related papers
- Training Plug-n-Play Knowledge Modules with Deep Context Distillation [52.94830874557649]
In this paper, we propose a way of modularizing knowledge by training document-level Knowledge Modules (KMs)<n> KMs are lightweight components implemented as parameter-efficient LoRA modules, which are trained to store information about new documents.<n>Our method outperforms standard next-token prediction and pre-instruction training techniques, across two datasets.
arXiv Detail & Related papers (2025-03-11T01:07:57Z) - Detecting Memorization in Large Language Models [0.0]
Large language models (LLMs) have achieved impressive results in natural language processing but are prone to memorizing portions of their training data.<n>Traditional methods for detecting memorization rely on output probabilities or loss functions.<n>We introduce an analytical method that precisely detects memorization by examining neuron activations within the LLM.
arXiv Detail & Related papers (2024-12-02T00:17:43Z) - Exploiting the Semantic Knowledge of Pre-trained Text-Encoders for Continual Learning [63.48785461956983]
Continual learning allows models to learn from new data while retaining previously learned knowledge.<n>The semantic knowledge available in the label information of the images, offers important semantic information that can be related with previously acquired knowledge of semantic classes.<n>We propose integrating semantic guidance within and across tasks by capturing semantic similarity using text embeddings.
arXiv Detail & Related papers (2024-08-02T07:51:44Z) - Everything is Editable: Extend Knowledge Editing to Unstructured Data in Large Language Models [65.10456412127405]
We propose a novel Unstructured Knowledge Editing method, namely UnKE.<n>In the layer dimension, we propose non-local block key-value storage to replace local layer key-value storage.<n>In the token dimension, we replace "term-driven optimization" with "cause-driven optimization", which edits the last token directly while preserving context.
arXiv Detail & Related papers (2024-05-24T08:42:40Z) - Do LLMs Dream of Ontologies? [13.776194387957617]
Large Models Language (LLMs) have demonstrated remarkable memorization across diverse natural language processing tasks.<n>This paper investigates the extent to which general-purpose LLMs correctly reproduce concept identifier (ID)-label associations from publicly available resources.
arXiv Detail & Related papers (2024-01-26T15:10:23Z) - Prompting Language Models for Linguistic Structure [73.11488464916668]
We present a structured prompting approach for linguistic structured prediction tasks.
We evaluate this approach on part-of-speech tagging, named entity recognition, and sentence chunking.
We find that while PLMs contain significant prior knowledge of task labels due to task leakage into the pretraining corpus, structured prompting can also retrieve linguistic structure with arbitrary labels.
arXiv Detail & Related papers (2022-11-15T01:13:39Z) - Schema-aware Reference as Prompt Improves Data-Efficient Knowledge Graph
Construction [57.854498238624366]
We propose a retrieval-augmented approach, which retrieves schema-aware Reference As Prompt (RAP) for data-efficient knowledge graph construction.
RAP can dynamically leverage schema and knowledge inherited from human-annotated and weak-supervised data as a prompt for each sample.
arXiv Detail & Related papers (2022-10-19T16:40:28Z) - Knowledgeable Salient Span Mask for Enhancing Language Models as
Knowledge Base [51.55027623439027]
We develop two solutions to help the model learn more knowledge from unstructured text in a fully self-supervised manner.
To our best knowledge, we are the first to explore fully self-supervised learning of knowledge in continual pre-training.
arXiv Detail & Related papers (2022-04-17T12:33:34Z) - Does Pre-training Induce Systematic Inference? How Masked Language
Models Acquire Commonsense Knowledge [91.15301779076187]
We introduce verbalized knowledge into the minibatches of a BERT model during pre-training and evaluate how well the model generalizes to supported inferences.
We find generalization does not improve over the course of pre-training, suggesting that commonsense knowledge is acquired from surface-level, co-occurrence patterns rather than induced, systematic reasoning.
arXiv Detail & Related papers (2021-12-16T03:13:04Z) - REALM: Retrieval-Augmented Language Model Pre-Training [37.3178586179607]
We augment language model pre-training with a latent knowledge retriever, which allows the model to retrieve and attend over documents from a large corpus such as Wikipedia.
For the first time, we show how to pre-train such a knowledge retriever in an unsupervised manner.
We demonstrate the effectiveness of Retrieval-Augmented Language Model pre-training (REALM) by fine-tuning on the challenging task of Open-domain Question Answering (Open-QA)
arXiv Detail & Related papers (2020-02-10T18:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.