Related papers: Memory Retrieval and Consolidation in Large Language Models through Function Tokens

Memory Retrieval and Consolidation in Large Language Models through Function Tokens

URL: http://arxiv.org/abs/2510.08203v1
Date: Thu, 09 Oct 2025 13:31:20 GMT
Title: Memory Retrieval and Consolidation in Large Language Models through Function Tokens
Authors: Shaohua Zhang, Yuan Lin, Hang Li,
Abstract summary: We propose the function token hypothesis to explain the workings of large language models (LLMs)<n>During inference, function tokens activate the most predictive features from context.<n>We find that pre-training, the training loss is dominated by predicting the next content tokens following function tokens.
Score: 9.46824580067366
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The remarkable success of large language models (LLMs) stems from their ability to consolidate vast amounts of knowledge into the memory during pre-training and to retrieve it from the memory during inference, enabling advanced capabilities such as knowledge memorization, instruction-following and reasoning. However, the mechanisms of memory retrieval and consolidation in LLMs remain poorly understood. In this paper, we propose the function token hypothesis to explain the workings of LLMs: During inference, function tokens activate the most predictive features from context and govern next token prediction (memory retrieval). During pre-training, predicting the next tokens (usually content tokens) that follow function tokens increases the number of learned features of LLMs and updates the model parameters (memory consolidation). Function tokens here roughly correspond to function words in linguistics, including punctuation marks, articles, prepositions, and conjunctions, in contrast to content tokens. We provide extensive experimental evidence supporting this hypothesis. Using bipartite graph analysis, we show that a small number of function tokens activate the majority of features. Case studies further reveal how function tokens activate the most predictive features from context to direct next token prediction. We also find that during pre-training, the training loss is dominated by predicting the next content tokens following function tokens, which forces the function tokens to select the most predictive features from context.

Related papers

SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs [59.415473779171315]
We propose a novel visual token pruning strategy called textbfSaliency-textbfCoverage textbfOriented token textbfPruning for textbfEfficient MLLMs.
arXiv Detail & Related papers (2025-10-28T09:29:37Z)
A circuit for predicting hierarchical structure in-context in Large Language Models [19.35678318316516]
Large Language Models (LLMs) excel at in-context learning, the ability to use information provided as context to improve prediction of future tokens.<n>In this study, we design a synthetic in-context learning task, where tokens are repeated with hierarchical dependencies.<n>We find adaptive induction heads that support prediction by learning what to attend to in-context.
arXiv Detail & Related papers (2025-09-25T20:20:23Z)
Improving Large Language Models with Concept-Aware Fine-Tuning [55.59287380665864]
Concept-Aware Fine-Tuning (CAFT) is a novel multi-token training method for large language models (LLMs)<n>CAFT enables the learning of sequences that span multiple tokens, fostering stronger concept-aware learning.<n>Experiments demonstrate significant improvements compared to conventional next-token finetuning methods.
arXiv Detail & Related papers (2025-06-09T14:55:00Z)
Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning [53.57895922042783]
Large Language Models (LLMs) excel at reasoning and planning when trained on chainof-thought (CoT) data.<n>We propose a hybrid representation of the reasoning process, where we partially abstract away the initial reasoning steps using latent discrete tokens.
arXiv Detail & Related papers (2025-02-05T15:33:00Z)
SEP: Self-Enhanced Prompt Tuning for Visual-Language Model [93.94454894142413]
We introduce a novel approach named Self-Enhanced Prompt Tuning (SEP) SEP explicitly incorporates discriminative prior knowledge to enhance both textual-level and visual-level embeddings. Comprehensive evaluations across various benchmarks and tasks confirm SEP's efficacy in prompt tuning.
arXiv Detail & Related papers (2024-05-24T13:35:56Z)
Tokenization Is More Than Compression [14.939912120571728]
Existing tokenization approaches like Byte-Pair. (BPE) originate from the field of data compression. We introduce PathPiece, a new tokenizer that segments a document's text into the minimum number of tokens for a given vocabulary.
arXiv Detail & Related papers (2024-02-28T14:52:15Z)
Identifying and Analyzing Performance-Critical Tokens in Large Language Models [52.404072802235234]
We study how large language models learn to perform tasks from demonstrations.<n>Our work sheds light on how large language models learn to perform tasks from demonstrations and deepens our understanding of the roles different types of tokens play in large language models.
arXiv Detail & Related papers (2024-01-20T20:55:21Z)
Understanding the Role of Input Token Characters in Language Models: How Does Information Loss Affect Performance? [45.53600782873268]
We study how information loss in input token characters affects the performance of pre-training language models. Surprisingly, we find that pre-training even under extreme settings, i.e. using only one character of each token, the performance retention in standard NLU benchmarks and probing tasks is high. For instance, a model pre-trained only on single first characters from tokens achieves performance retention of approximately $90$% and $77$% of the full-token model in SuperGLUE and GLUE tasks, respectively.
arXiv Detail & Related papers (2023-10-26T09:47:50Z)
Predictive Representation Learning for Language Modeling [33.08232449211759]
Correlates of secondary information appear in LSTM representations even though they are not part of an emphexplicitly supervised prediction task. We propose Predictive Representation Learning (PRL), which explicitly constrains LSTMs to encode specific predictions.
arXiv Detail & Related papers (2021-05-29T05:03:47Z)
Fast End-to-End Speech Recognition via a Non-Autoregressive Model and Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once) The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.