MeKi: Memory-based Expert Knowledge Injection for Efficient LLM Scaling
- URL: http://arxiv.org/abs/2602.03359v1
- Date: Tue, 03 Feb 2026 10:32:04 GMT
- Title: MeKi: Memory-based Expert Knowledge Injection for Efficient LLM Scaling
- Authors: Ning Ding, Fangcheng Liu, Kyungrae Kim, Linji Hao, Kyeng-Hun Lee, Hyeonmok Ko, Yehui Tang,
- Abstract summary: Scaling Large Language Models (LLMs) typically relies on increasing the number of parameters or test-time computations to boost performance.<n>MeKi (Memory-based Expert Knowledge Injection) is a novel system that scales LLM capacity via storage space rather than FLOPs.<n>MeKi significantly outperforms dense LLM baselines with identical inference speed.
- Score: 29.784396745475835
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Scaling Large Language Models (LLMs) typically relies on increasing the number of parameters or test-time computations to boost performance. However, these strategies are impractical for edge device deployment due to limited RAM and NPU resources. Despite hardware constraints, deploying performant LLM on edge devices such as smartphone remains crucial for user experience. To address this, we propose MeKi (Memory-based Expert Knowledge Injection), a novel system that scales LLM capacity via storage space rather than FLOPs. MeKi equips each Transformer layer with token-level memory experts that injects pre-stored semantic knowledge into the generation process. To bridge the gap between training capacity and inference efficiency, we employ a re-parameterization strategy to fold parameter matrices used during training into a compact static lookup table. By offloading the knowledge to ROM, MeKi decouples model capacity from computational cost, introducing zero inference latency overhead. Extensive experiments demonstrate that MeKi significantly outperforms dense LLM baselines with identical inference speed, validating the effectiveness of memory-based scaling paradigm for on-device LLMs. Project homepage is at https://github.com/ningding-o/MeKi.
Related papers
- Adaptive Memory Admission Control for LLM Agents [9.04001220868675]
We propose Adaptive Memory Admission Control (A-MAC), a framework that treats memory admission as a structured decision problem.<n>A-MAC decomposes memory value into five complementary and interpretable factors: future utility, factual confidence, semantic novelty, temporal recency, and content type prior.<n>A-MAC achieves a superior precision-recall tradeoff, improving F1 to 0.583 while reducing latency by 31% compared to state-of-the-art LLM-native memory systems.
arXiv Detail & Related papers (2026-03-04T19:32:02Z) - MemSifter: Offloading LLM Memory Retrieval via Outcome-Driven Proxy Reasoning [78.46301394559903]
Large Language Models (LLMs) are increasingly used for long-duration tasks.<n>Current methods face a trade-off between cost and accuracy.<n>MemSifter is a novel framework that offloads the memory retrieval process to a small-scale proxy model.
arXiv Detail & Related papers (2026-03-03T02:57:38Z) - MemCtrl: Using MLLMs as Active Memory Controllers on Embodied Agents [53.44122827359892]
We propose MemCtrl, a framework that uses Multimodal Large Language Models (MLLMs) for pruning memory online.<n>-augmented MLLMs show an improvement of around 16% on average, with over 20% on specific instruction subsets.
arXiv Detail & Related papers (2026-01-28T18:31:17Z) - MemLoRA: Distilling Expert Adapters for On-Device Memory Systems [71.32550994522738]
Memory-augmented Large Language Models (LLMs) have demonstrated remarkable consistency during dialogues.<n>MemLoRA is a novel memory system that integrates small Vision-Language Models.<n>VLM-integrated MemLoRA-V shows massive improvements in caption-based approaches.
arXiv Detail & Related papers (2025-12-04T12:56:30Z) - Reversing Large Language Models for Efficient Training and Fine-Tuning [24.232966507637673]
Large Language Models (LLMs) are known for their expensive and time-consuming training.<n>We introduce memory-efficient, reversible architectures for LLMs inspired by symmetric and symplectic differential equations.<n>Our results show comparable or improved performance on several datasets and benchmarks.
arXiv Detail & Related papers (2025-11-27T19:32:15Z) - Mixture-of-Channels: Exploiting Sparse FFNs for Efficient LLMs Pre-Training and Inference [16.71963410333802]
Large language models (LLMs) have demonstrated remarkable success across diverse artificial intelligence tasks.<n>MoC substantially reduces activation memory during pre-training.<n>MoC delivers significant memory savings and throughput gains while maintaining competitive model performance.
arXiv Detail & Related papers (2025-11-12T13:30:57Z) - MemOS: A Memory OS for AI System [116.87568350346537]
Large Language Models (LLMs) have become an essential infrastructure for Artificial General Intelligence (AGI)<n>Existing models mainly rely on static parameters and short-lived contextual states, limiting their ability to track user preferences or update knowledge over extended periods.<n>MemOS is a memory operating system that treats memory as a manageable system resource.
arXiv Detail & Related papers (2025-07-04T17:21:46Z) - eFedLLM: Efficient LLM Inference Based on Federated Learning [1.6179784294541053]
Large Language Models (LLMs) herald a transformative era in artificial intelligence (AI)
This paper introduces an effective approach that enhances the operational efficiency and affordability of LLM inference.
arXiv Detail & Related papers (2024-11-24T22:50:02Z) - Search for Efficient Large Language Models [52.98684997131108]
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research.
Weight pruning, quantization, and distillation have been embraced to compress LLMs, targeting memory reduction and inference acceleration.
Most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.
arXiv Detail & Related papers (2024-09-25T21:32:12Z) - MobiZO: Enabling Efficient LLM Fine-Tuning at the Edge via Inference Engines [28.18421624702502]
We introduce MobiZO, a resource-efficient fine-tuning framework for Large Language Models (LLMs) specifically designed for edge devices.<n>We show that MobiZO achieves substantial runtime speedups and memory savings while improving fine-tuning accuracy.<n> Experiments demonstrate that MobiZO achieves substantial runtime speedups and memory savings while improving fine-tuning accuracy.
arXiv Detail & Related papers (2024-09-23T20:14:09Z) - MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training [24.066283519769968]
Large Language Models (LLMs) have been trained using extended context lengths to foster more creative applications.<n>We propose MEMO, a novel framework for fine-grained activation memory management.<n>MeMO achieves an average of 1.97x and 1.80x MFU compared to Megatron-LM and DeepSpeed.
arXiv Detail & Related papers (2024-07-16T18:59:49Z) - Low-Rank Quantization-Aware Training for LLMs [8.535254310145005]
Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands.
We propose LR-QAT -- a lightweight and memory-efficient QAT algorithm for LLMs.
Our method outperforms common post-training quantization (PTQ) approaches and reaches the same model performance as full-model QAT at the fraction of its memory usage.
arXiv Detail & Related papers (2024-06-10T15:44:22Z) - MemLLM: Finetuning LLMs to Use An Explicit Read-Write Memory [49.96019697955383]
We introduce MemLLM, a novel method of enhancing large language models (LLMs) by integrating a structured and explicit read-and-write memory module.<n>Our experiments indicate that MemLLM enhances the LLM's performance and interpretability, in language modeling in general and knowledge-intensive tasks in particular.
arXiv Detail & Related papers (2024-04-17T18:13:16Z) - In Situ Framework for Coupling Simulation and Machine Learning with
Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations.
As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks.
This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.