Memory Layers at Scale
- URL: http://arxiv.org/abs/2412.09764v2
- Date: Fri, 20 Dec 2024 17:36:52 GMT
- Title: Memory Layers at Scale
- Authors: Vincent-Pierre Berges, Barlas Oğuz, Daniel Haziza, Wen-tau Yih, Luke Zettlemoyer, Gargi Ghosh,
- Abstract summary: This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale.
On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the budget, as well as mixture-of-expert models when matched for both compute and parameters.
We provide a fully parallelizable memory layer implementation, demonstrating scaling laws with up to 128B memory parameters, pretrained to 1 trillion tokens, comparing to base models with up to 8B parameters.
- Score: 67.00854080570979
- License:
- Abstract: Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsely activated memory layers complement compute-heavy dense feed-forward layers, providing dedicated capacity to store and retrieve information cheaply. This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale. On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the computation budget, as well as mixture-of-expert models when matched for both compute and parameters. We find gains are especially pronounced for factual tasks. We provide a fully parallelizable memory layer implementation, demonstrating scaling laws with up to 128B memory parameters, pretrained to 1 trillion tokens, comparing to base models with up to 8B parameters.
Related papers
- Memory-Efficient Training for Deep Speaker Embedding Learning in Speaker Verification [50.596077598766975]
We explore a memory-efficient training strategy for deep speaker embedding learning in resource-constrained scenarios.
For activations, we design two types of reversible neural networks which eliminate the need to store intermediate activations.
For states, we introduce a dynamic quantization approach that replaces the original 32-bit floating-point values with a dynamic tree-based 8-bit data type.
arXiv Detail & Related papers (2024-12-02T06:57:46Z) - SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - B'MOJO: Hybrid State Space Realizations of Foundation Models with Eidetic and Fading Memory [91.81390121042192]
We develop a class of models called B'MOJO to seamlessly combine eidetic and fading memory within an composable module.
B'MOJO's ability to modulate eidetic and fading memory results in better inference on longer sequences tested up to 32K tokens.
arXiv Detail & Related papers (2024-07-08T18:41:01Z) - $\text{Memory}^3$: Language Modeling with Explicit Memory [22.572376536612015]
We equip large language models (LLMs) with explicit memory, a memory format cheaper than model parameters and text retrieval-augmented generation (RAG)
As a preliminary proof of concept, we train from scratch a 2.4B LLM, which achieves better performance than much larger LLMs and RAG models.
We introduce a memory circuitry theory to support the externalization of knowledge, and present novel techniques including a memory sparsification mechanism that makes storage tractable.
arXiv Detail & Related papers (2024-07-01T11:07:23Z) - MEMORY-VQ: Compression for Tractable Internet-Scale Memory [45.7528997281282]
Memory-based methods like LUMEN pre-compute token representations for retrieved passages to drastically speed up inference.
We propose MEMORY-VQ, a new method to reduce storage requirements of memory-augmented models without sacrificing performance.
arXiv Detail & Related papers (2023-08-28T21:11:18Z) - A Model or 603 Exemplars: Towards Memory-Efficient Class-Incremental
Learning [56.450090618578]
Class-Incremental Learning (CIL) aims to train a model with limited memory size to meet this requirement.
We show that when counting the model size into the total budget and comparing methods with aligned memory size, saving models do not consistently work.
We propose a simple yet effective baseline, denoted as MEMO for Memory-efficient Expandable MOdel.
arXiv Detail & Related papers (2022-05-26T08:24:01Z) - Semantically Constrained Memory Allocation (SCMA) for Embedding in
Efficient Recommendation Systems [27.419109620575313]
A key challenge for deep learning models is to work with millions of categorical classes or tokens.
We propose a novel formulation of memory shared embedding, where memory is shared in proportion to the overlap in semantic information.
We demonstrate a significant reduction in the memory footprint while maintaining performance.
arXiv Detail & Related papers (2021-02-24T19:55:49Z) - Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling.
Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.