Related papers: Ultra-Sparse Memory Network

Related papers

Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation [129.45368843861917]
We introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers.<n>We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs to share memory readout states from a Samba-based self-decoder.
arXiv Detail & Related papers (2025-07-09T07:27:00Z)
MOM: Memory-Efficient Offloaded Mini-Sequence Inference for Long Context Language Models [72.61076288351201]
We propose Memory-efficient Offloaded Mini-sequence Inference (MOM) MOM partitions critical layers into smaller "mini-sequences" and integrates seamlessly with KV cache offloading. On Meta-Llama-3.2-8B, MOM extends the maximum context length from 155k to 455k tokens on a single A100 80GB GPU.
arXiv Detail & Related papers (2025-04-16T23:15:09Z)
ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration [61.579842548990754]
Mixture-of-Experts (MoE) Transformer, the backbone of multiple phenomenal language models, leverages sparsity by activating only a fraction of model parameters for each input token. We introduce ResMoE, an innovative MoE approximation framework that utilizes Wasserstein barycenter to extract a common expert (barycenter expert) and approximate the residuals between this barycenter expert and the original ones.
arXiv Detail & Related papers (2025-03-10T03:15:54Z)
Memory Layers at Scale [67.00854080570979]
This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale. On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the budget, as well as mixture-of-expert models when matched for both compute and parameters. We provide a fully parallelizable memory layer implementation, demonstrating scaling laws with up to 128B memory parameters, pretrained to 1 trillion tokens, comparing to base models with up to 8B parameters.
arXiv Detail & Related papers (2024-12-12T23:56:57Z)
LaMamba-Diff: Linear-Time High-Fidelity Diffusion Models Based on Local Attention and Mamba [54.85262314960038]
Local Attentional Mamba blocks capture both global contexts and local details with linear complexity. Our model exhibits exceptional scalability and surpasses the performance of DiT across various model scales on ImageNet at 256x256 resolution. Compared to state-of-the-art diffusion models on ImageNet 256x256 and 512x512, our largest model presents notable advantages, such as a reduction of up to 62% GFLOPs.
arXiv Detail & Related papers (2024-08-05T16:39:39Z)
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios. In the early route, intermediate outputs are consolidated via an anti-redundancy operation. In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z)
Mixture of A Million Experts [1.240096657086732]
This paper introduces PEER, a novel layer design that utilizes the product key technique for sparse retrieval from a vast pool of experts. Experiments on language modeling tasks demonstrate that PEER layers outperform dense FFWs and coarse-grained MoEs in terms of performance-compute trade-off.
arXiv Detail & Related papers (2024-07-04T20:59:20Z)
Scalable MatMul-free Language Modeling [9.048532540945086]
MatMul operations can be eliminated from large language models.<n>MatMul-free models, tested on models up to 2.7B parameters, are comparable to state-of-the-art pre-trained Transformers.
arXiv Detail & Related papers (2024-06-04T17:50:34Z)
Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [4.674454841332859]
Transformer-based models have emerged as one of the most widely used architectures for natural language processing. These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators. We propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase.
arXiv Detail & Related papers (2024-05-17T00:52:39Z)
AI and Memory Wall [81.06494558184049]
We show how memory bandwidth can become the dominant bottleneck for decoder models. We argue for a redesign in model architecture, training, and deployment strategies to overcome this memory limitation.
arXiv Detail & Related papers (2024-03-21T04:31:59Z)
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference [23.207326766883405]
Mixture-of-Experts (MoE) is able to scale its model size without proportionally scaling up its computational requirements. Pre-gated MoE employs our novel pre-gating function which alleviates the dynamic nature of sparse expert activation. We demonstrate that Pre-gated MoE is able to improve performance, reduce GPU memory consumption, while also maintaining the same level of model quality.
arXiv Detail & Related papers (2023-08-23T11:25:37Z)
SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference. We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)
DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale [20.558091867632445]
DeepSpeed Inference is a comprehensive system solution for transformer model inference. It reduces latency by up to 7.3X over the state-of-the-art for latency-oriented scenarios and increases throughput by over 1.5x for throughput-oriented scenarios. It can inference 25x larger models than with GPU-only solutions, while delivering a high throughput of 84 TFLOPS (over $50%$ of A6000 peak)
arXiv Detail & Related papers (2022-06-30T18:01:08Z)
A Model or 603 Exemplars: Towards Memory-Efficient Class-Incremental Learning [56.450090618578]
Class-Incremental Learning (CIL) aims to train a model with limited memory size to meet this requirement. We show that when counting the model size into the total budget and comparing methods with aligned memory size, saving models do not consistently work. We propose a simple yet effective baseline, denoted as MEMO for Memory-efficient Expandable MOdel.
arXiv Detail & Related papers (2022-05-26T08:24:01Z)
LiteTransformerSearch: Training-free On-device Search for Efficient Autoregressive Language Models [34.673688610935876]
We show that the latency and perplexity pareto-frontier can be found without need for any model training. We evaluate our method, dubbed Lightweight Transformer Search (LTS), on diverse devices. We show that the perplexity of Transformer-XL can be achieved with up to 2x lower latency.
arXiv Detail & Related papers (2022-03-04T02:10:43Z)
Learned Queries for Efficient Local Attention [11.123272845092611]
Self-attention mechanism in vision transformers suffers from high latency and inefficient memory utilization. We propose a new shift-invariant local attention layer, called query and attend (QnA), that aggregates the input locally in an overlapping manner. We show improvements in speed and memory complexity while achieving comparable accuracy with state-of-the-art models.
arXiv Detail & Related papers (2021-12-21T18:52:33Z)
Towards Memory-Efficient Neural Networks via Multi-Level in situ Generation [10.563649948220371]
Deep neural networks (DNN) have shown superior performance in a variety of tasks. As they rapidly evolve, their escalating computation and memory demands make it challenging to deploy them on resource-constrained edge devices. We propose a general and unified framework to trade expensive memory transactions with ultra-fast on-chip computations.
arXiv Detail & Related papers (2021-08-25T18:50:24Z)
Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling. Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.