Semantically Constrained Memory Allocation (SCMA) for Embedding in
Efficient Recommendation Systems
- URL: http://arxiv.org/abs/2103.06124v1
- Date: Wed, 24 Feb 2021 19:55:49 GMT
- Title: Semantically Constrained Memory Allocation (SCMA) for Embedding in
Efficient Recommendation Systems
- Authors: Aditya Desai, Yanzhou Pan, Kuangyuan Sun, Li Chou, Anshumali
Shrivastava
- Abstract summary: A key challenge for deep learning models is to work with millions of categorical classes or tokens.
We propose a novel formulation of memory shared embedding, where memory is shared in proportion to the overlap in semantic information.
We demonstrate a significant reduction in the memory footprint while maintaining performance.
- Score: 27.419109620575313
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning-based models are utilized to achieve state-of-the-art
performance for recommendation systems. A key challenge for these models is to
work with millions of categorical classes or tokens. The standard approach is
to learn end-to-end, dense latent representations or embeddings for each token.
The resulting embeddings require large amounts of memory that blow up with the
number of tokens. Training and inference with these models create storage, and
memory bandwidth bottlenecks leading to significant computing and energy
consumption when deployed in practice. To this end, we present the problem of
\textit{Memory Allocation} under budget for embeddings and propose a novel
formulation of memory shared embedding, where memory is shared in proportion to
the overlap in semantic information. Our formulation admits a practical and
efficient randomized solution with Locality sensitive hashing based Memory
Allocation (LMA). We demonstrate a significant reduction in the memory
footprint while maintaining performance. In particular, our LMA embeddings
achieve the same performance compared to standard embeddings with a 16$\times$
reduction in memory footprint. Moreover, LMA achieves an average improvement of
over 0.003 AUC across different memory regimes than standard DLRM models on
Criteo and Avazu datasets
Related papers
- Stable Hadamard Memory: Revitalizing Memory-Augmented Agents for Reinforcement Learning [64.93848182403116]
Current deep-learning memory models struggle in reinforcement learning environments that are partially observable and long-term.
We introduce the Stable Hadamard Memory, a novel memory model for reinforcement learning agents.
Our approach significantly outperforms state-of-the-art memory-based methods on challenging partially observable benchmarks.
arXiv Detail & Related papers (2024-10-14T03:50:17Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z) - CAMELoT: Towards Large Language Models with Training-Free Consolidated
Associative Memory [38.429707659685974]
Large Language Models (LLMs) struggle to handle long input sequences due to high memory and runtime costs.
We introduce an associative memory module which can be coupled to any pre-trained (frozen) attention-based LLM without re-training.
This architecture, which we call CAMELoT, demonstrates superior performance even with a tiny context window of 128 tokens.
arXiv Detail & Related papers (2024-02-21T01:00:17Z) - HEAM : Hashed Embedding Acceleration using Processing-In-Memory [17.66751227197112]
In today's data centers, personalized recommendation systems face challenges such as the need for large memory capacity and high bandwidth.
Previous approaches have relied on DIMM-based near-memory processing techniques or introduced 3D-stacked DRAM to address memory-bound issues.
This paper introduces HEAM, a heterogeneous memory architecture that integrates 3D-stacked DRAM with DIMM to accelerate recommendation systems.
arXiv Detail & Related papers (2024-02-06T14:26:22Z) - Topology-aware Embedding Memory for Continual Learning on Expanding Networks [63.35819388164267]
We present a framework to tackle the memory explosion problem using memory replay techniques.
PDGNNs with Topology-aware Embedding Memory (TEM) significantly outperform state-of-the-art techniques.
arXiv Detail & Related papers (2024-01-24T03:03:17Z) - Frustratingly Simple Memory Efficiency for Pre-trained Language Models
via Dynamic Embedding Pruning [42.652021176354644]
The memory footprint of pre-trained language models (PLMs) can hinder deployment in memory-constrained settings.
We propose a simple yet effective approach that leverages this finding to minimize the memory footprint of the embedding matrix.
We show that this approach provides substantial reductions in memory usage across a wide range of models and tasks.
arXiv Detail & Related papers (2023-09-15T19:00:00Z) - A Model or 603 Exemplars: Towards Memory-Efficient Class-Incremental
Learning [56.450090618578]
Class-Incremental Learning (CIL) aims to train a model with limited memory size to meet this requirement.
We show that when counting the model size into the total budget and comparing methods with aligned memory size, saving models do not consistently work.
We propose a simple yet effective baseline, denoted as MEMO for Memory-efficient Expandable MOdel.
arXiv Detail & Related papers (2022-05-26T08:24:01Z) - Kanerva++: extending The Kanerva Machine with differentiable, locally
block allocated latent memory [75.65949969000596]
Episodic and semantic memory are critical components of the human memory model.
We develop a new principled Bayesian memory allocation scheme that bridges the gap between episodic and semantic memory.
We demonstrate that this allocation scheme improves performance in memory conditional image generation.
arXiv Detail & Related papers (2021-02-20T18:40:40Z) - Neural Network Compression for Noisy Storage Devices [71.4102472611862]
Conventionally, model compression and physical storage are decoupled.
This approach forces the storage to treat each bit of the compressed model equally, and to dedicate the same amount of resources to each bit.
We propose a radically different approach that: (i) employs analog memories to maximize the capacity of each memory cell, and (ii) jointly optimize model compression and physical storage to maximize memory utility.
arXiv Detail & Related papers (2021-02-15T18:19:07Z) - Distributed Associative Memory Network with Memory Refreshing Loss [5.5792083698526405]
We introduce a novel Distributed Associative Memory architecture (DAM) with Memory Refreshing Loss (MRL)
Inspired by how the human brain works, our framework encodes data with distributed representation across multiple memory blocks.
MRL enables MANN to reinforce an association between input data and task objective by reproducing input data from stored memory contents.
arXiv Detail & Related papers (2020-07-21T07:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.