Memory in Large Language Models: Mechanisms, Evaluation and Evolution
- URL: http://arxiv.org/abs/2509.18868v1
- Date: Tue, 23 Sep 2025 10:06:58 GMT
- Title: Memory in Large Language Models: Mechanisms, Evaluation and Evolution
- Authors: Dianxing Zhang, Wendong Li, Kani Song, Jiaye Lu, Gang Li, Liuchun Yang, Sheng Li,
- Abstract summary: We propose a four-part taxonomy (parametric, contextual, external, procedural/episodic) and a memory quadruple (location, persistence, write/access path, controllability)<n>For updating and forgetting, we present DMM Gov: coordinating DAPT/TAPT, PEFT, model editing (ROME, MEND, MEMIT, SERAC), and RAG to form an auditable loop.<n>This yields a reproducible, comparable, and governable coordinate system for research and deployment.
- Score: 8.158439933515131
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Under a unified operational definition, we define LLM memory as a persistent state written during pretraining, finetuning, or inference that can later be addressed and that stably influences outputs. We propose a four-part taxonomy (parametric, contextual, external, procedural/episodic) and a memory quadruple (location, persistence, write/access path, controllability). We link mechanism, evaluation, and governance via the chain write -> read -> inhibit/update. To avoid distorted comparisons across heterogeneous setups, we adopt a three-setting protocol (parametric only, offline retrieval, online retrieval) that decouples capability from information availability on the same data and timeline. On this basis we build a layered evaluation: parametric (closed-book recall, edit differential, memorization/privacy), contextual (position curves and the mid-sequence drop), external (answer correctness vs snippet attribution/faithfulness), and procedural/episodic (cross-session consistency and timeline replay, E MARS+). The framework integrates temporal governance and leakage auditing (freshness hits, outdated answers, refusal slices) and uncertainty reporting via inter-rater agreement plus paired tests with multiple-comparison correction. For updating and forgetting, we present DMM Gov: coordinating DAPT/TAPT, PEFT, model editing (ROME, MEND, MEMIT, SERAC), and RAG to form an auditable loop covering admission thresholds, rollout, monitoring, rollback, and change audits, with specs for timeliness, conflict handling, and long-horizon consistency. Finally, we give four testable propositions: minimum identifiability; a minimal evaluation card; causally constrained editing with verifiable forgetting; and when retrieval with small-window replay outperforms ultra-long-context reading. This yields a reproducible, comparable, and governable coordinate system for research and deployment.
Related papers
- Training Memory in Deep Neural Networks: Mechanisms, Evidence, and Measurement Gaps [1.078600700827543]
This is a protocol for portable, causal, uncertainty-aware measurement that attributes how much training history matters across models, data queues, and audit artifacts.
arXiv Detail & Related papers (2026-01-29T12:26:52Z) - Gated Differentiable Working Memory for Long-Context Language Modeling [80.27483324685434]
We propose Gdwm (Gated Differentiable Working Memory), a framework that introduces a write controller to gate the consolidation process.<n>Experiments on ZeroSCROLLS and LongBench v2 demonstrate that Gdwm achieves comparable or superior performance with 4$times$ fewer gradient steps than uniform baselines.
arXiv Detail & Related papers (2026-01-19T10:00:33Z) - Verifiable Fine-Tuning for LLMs: Zero-Knowledge Training Proofs Bound to Data Provenance and Policy [0.0]
We present Verifiable Fine Tuning, a protocol and system that produces succinct zero knowledge proofs.<n>We show that the system composes with probabilistic audits and bandwidth constraints.<n>Results indicate that the system is feasible today for real parameter efficient pipelines.
arXiv Detail & Related papers (2025-10-19T13:33:27Z) - Probing Pre-trained Language Models on Code Changes: Insights from ReDef, a High-Confidence Just-in-Time Defect Prediction Dataset [0.0]
We present ReDef, a high-confidence benchmark of function-level modifications curated from 22 large-scale C/C++ projects.<n>Defective cases are anchored by revert commits, while clean cases are validated through post-hoc history checks.<n>This pipeline yields 3,164 defective and 10,268 clean modifications, offering substantially more reliable labels than prior existing resources.
arXiv Detail & Related papers (2025-09-11T07:07:11Z) - Unlearning at Scale: Implementing the Right to be Forgotten in Large Language Models [0.0]
Our approach treats as a minimal program and logs permicrobatch record.<n>Under pinned stack and deterministic kernels, replaying the training tail yields the same parameters as training retain set.
arXiv Detail & Related papers (2025-08-17T03:29:22Z) - Gumbel Reranking: Differentiable End-to-End Reranker Optimization [61.16471123356738]
RAG systems rely on rerankers to identify relevant documents.<n> fine-tuning these models remains challenging due to the scarcity of annotated query-document pairs.<n>We propose Gumbel Reranking, an end-to-end training framework for rerankers aimed at minimizing the training-inference gap.
arXiv Detail & Related papers (2025-02-16T13:23:39Z) - End-to-End Page-Level Assessment of Handwritten Text Recognition [69.55992406968495]
HTR systems increasingly face the end-to-end page-level transcription of a document.
Standard metrics do not take into account the inconsistencies that might appear.
We propose a two-fold evaluation, where the transcription accuracy and the RO goodness are considered separately.
arXiv Detail & Related papers (2023-01-14T15:43:07Z) - CoP: Factual Inconsistency Detection by Controlling the Preference [45.4045488637761]
We propose an unsupervised framework named CoP by controlling the preference of the generation model with the help of prompt.
With the properly designed prompt, our framework could evaluate specific preferences and serve as measurements for fine-grained categories of inconsistency.
Experiments show that our framework achieves new SOTA results on three factual inconsistency detection tasks.
arXiv Detail & Related papers (2022-12-03T13:05:24Z) - Conformance Checking with Uncertainty via SMT (Extended Version) [66.58864135810981]
We show how to solve the problem of checking conformance of uncertain logs against data-aware reference processes.
Our approach is modular, in that it homogeneously accommodates for different types of uncertainty.
We show the correctness of our approach and witness feasibility through a proof-of-concept implementation.
arXiv Detail & Related papers (2022-06-15T11:39:45Z) - Relation Extraction as Open-book Examination: Retrieval-enhanced Prompt
Tuning [109.7767515627765]
We propose a new semiparametric paradigm of retrieval-enhanced prompt tuning for relation extraction.
Our model infers relation through knowledge stored in the weights during training.
Our method can achieve state-of-the-art in both standard supervised and few-shot settings.
arXiv Detail & Related papers (2022-05-04T23:38:37Z) - CoCoMoT: Conformance Checking of Multi-Perspective Processes via SMT
(Extended Version) [62.96267257163426]
We introduce the CoCoMoT (Computing Conformance Modulo Theories) framework.
First, we show how SAT-based encodings studied in the pure control-flow setting can be lifted to our data-aware case.
Second, we introduce a novel preprocessing technique based on a notion of property-preserving clustering.
arXiv Detail & Related papers (2021-03-18T20:22:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.