Related papers: Least Squares Maximum and Weighted Generalization-Memorization Machines

Related papers

LatentLLM: Attention-Aware Joint Tensor Compression [50.33925662486034]
Large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources.<n>We propose a new framework to convert such LLMs/LMMs into a reduced-dimension latent structure.
arXiv Detail & Related papers (2025-05-23T22:39:54Z)
MOM: Memory-Efficient Offloaded Mini-Sequence Inference for Long Context Language Models [72.61076288351201]
We propose Memory-efficient Offloaded Mini-sequence Inference (MOM) MOM partitions critical layers into smaller "mini-sequences" and integrates seamlessly with KV cache offloading. On Meta-Llama-3.2-8B, MOM extends the maximum context length from 155k to 455k tokens on a single A100 80GB GPU.
arXiv Detail & Related papers (2025-04-16T23:15:09Z)
COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMs [81.01082659623552]
Large Language Models (LLMs) have demonstrated remarkable success across various domains. Their optimization remains a significant challenge due to the complex and high-dimensional loss landscapes they inhabit.
arXiv Detail & Related papers (2025-02-24T18:42:19Z)
Sparse Gradient Compression for Fine-Tuning Large Language Models [58.44973963468691]
Fine-tuning large language models (LLMs) for downstream tasks has become increasingly crucial due to their widespread use and the growing availability of open-source models. High memory costs associated with fine-tuning remain a significant challenge, especially as models increase in size. We propose sparse compression gradient (SGC) to address these limitations.
arXiv Detail & Related papers (2025-02-01T04:18:28Z)
SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimization [0.5755004576310332]
SMMF is a memory-efficient that reduces the memory requirement of the widely used adaptive learning rate Matrix, such as Adam, by up to 96%. We conduct a regret bound analysis of SMMF, which shows that it converges similarly to non-memory-efficient adaptive learning rate Matrix, such as AdamNC. In our experiment, SMMF takes up to 96% less memory compared to state-of-the-art memory efficients, e.g., Adafactor, CAME, and SM3, while achieving comparable model performance.
arXiv Detail & Related papers (2024-12-12T03:14:50Z)
MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems [26.493762260392284]
MoE-CAP is a benchmarking method for evaluating sparse MoE systems. Its key innovation is a sparsity-aware CAP analysis model, the first to integrate cost, performance, and accuracy metrics into a single diagram.
arXiv Detail & Related papers (2024-12-10T00:19:28Z)
Ultra-Sparse Memory Network [8.927205198458994]
This work introduces UltraMem, incorporating large-scale, ultra-sparse memory layer to address these limitations. We show that our method achieves state-of-the-art inference speed and model performance within a given computational budget.
arXiv Detail & Related papers (2024-11-19T09:24:34Z)
Hermes: Memory-Efficient Pipeline Inference for Large Models on Edge Devices [19.96064012736243]
This paper introduces PIPELOAD, a memory-efficient pipeline execution mechanism. It reduces memory usage by incorporating dynamic memory management and minimizes inference latency. We present Hermes, a framework optimized for large model inference on edge devices.
arXiv Detail & Related papers (2024-09-06T12:55:49Z)
Scalable MatMul-free Language Modeling [9.048532540945086]
MatMul operations can be eliminated from large language models.<n>MatMul-free models, tested on models up to 2.7B parameters, are comparable to state-of-the-art pre-trained Transformers.
arXiv Detail & Related papers (2024-06-04T17:50:34Z)
Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models [73.48675708831328]
We propose a novel parameter and computation efficient tuning method for Multi-modal Large Language Models (MLLMs) The Efficient Attention Skipping (EAS) method evaluates the attention redundancy and skips the less important MHAs to speed up inference. The experiments show that EAS not only retains high performance and parameter efficiency, but also greatly speeds up inference speed.
arXiv Detail & Related papers (2024-03-22T14:20:34Z)
MEMORYLLM: Towards Self-Updatable Large Language Models [101.3777486749529]
Existing Large Language Models (LLMs) usually remain static after deployment. We introduce MEMORYLLM, a model that comprises a transformer and a fixed-size memory pool. MEMORYLLM can self-update with text knowledge and memorize the knowledge injected earlier.
arXiv Detail & Related papers (2024-02-07T07:14:11Z)
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference [23.207326766883405]
Mixture-of-Experts (MoE) is able to scale its model size without proportionally scaling up its computational requirements. Pre-gated MoE employs our novel pre-gating function which alleviates the dynamic nature of sparse expert activation. We demonstrate that Pre-gated MoE is able to improve performance, reduce GPU memory consumption, while also maintaining the same level of model quality.
arXiv Detail & Related papers (2023-08-23T11:25:37Z)
A Model or 603 Exemplars: Towards Memory-Efficient Class-Incremental Learning [56.450090618578]
Class-Incremental Learning (CIL) aims to train a model with limited memory size to meet this requirement. We show that when counting the model size into the total budget and comparing methods with aligned memory size, saving models do not consistently work. We propose a simple yet effective baseline, denoted as MEMO for Memory-efficient Expandable MOdel.
arXiv Detail & Related papers (2022-05-26T08:24:01Z)
Master memory function for delay-based reservoir computers with single-variable dynamics [0.0]
We show that many delay-based reservoir computers can be characterized by a universal master memory function (MMF) Once computed for two independent parameters, this function provides linear memory capacity for any delay-based single-variable reservoir with small inputs.
arXiv Detail & Related papers (2021-08-28T13:17:24Z)
Semantically Constrained Memory Allocation (SCMA) for Embedding in Efficient Recommendation Systems [27.419109620575313]
A key challenge for deep learning models is to work with millions of categorical classes or tokens. We propose a novel formulation of memory shared embedding, where memory is shared in proportion to the overlap in semantic information. We demonstrate a significant reduction in the memory footprint while maintaining performance.
arXiv Detail & Related papers (2021-02-24T19:55:49Z)
Kanerva++: extending The Kanerva Machine with differentiable, locally block allocated latent memory [75.65949969000596]
Episodic and semantic memory are critical components of the human memory model. We develop a new principled Bayesian memory allocation scheme that bridges the gap between episodic and semantic memory. We demonstrate that this allocation scheme improves performance in memory conditional image generation.
arXiv Detail & Related papers (2021-02-20T18:40:40Z)
Scaling Hidden Markov Language Models [118.55908381553056]
This work revisits the challenge of scaling HMMs to language modeling datasets. We propose methods for scaling HMMs to massive state spaces while maintaining efficient exact inference, a compact parameterization, and effective regularization.
arXiv Detail & Related papers (2020-11-09T18:51:55Z)
Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling. Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.