Related papers: FlashMem: Supporting Modern DNN Workloads on Mobile with GPU Memory Hierarchy Optimizations

FlashMem: Supporting Modern DNN Workloads on Mobile with GPU Memory Hierarchy Optimizations

URL: http://arxiv.org/abs/2602.15379v1
Date: Tue, 17 Feb 2026 06:23:40 GMT
Title: FlashMem: Supporting Modern DNN Workloads on Mobile with GPU Memory Hierarchy Optimizations
Authors: Zhihao Shu, Md Musfiqur Rahman Sanim, Hangyu Zheng, Kunxiong Zhu, Miao Yin, Gagan Agrawal, Wei Niu,
Abstract summary: FlashMem is a memory streaming framework designed to efficiently execute large-scale modern deep neural networks and multi-DNN workloads.<n>We show that FlashMem achieves 2.0x to 8.4x memory reduction and 1.7x to 75.0x speedup compared to existing frameworks.<n>Results on 11 models demonstrate that FlashMem achieves 2.0x to 8.4x memory reduction and 1.7x to 75.0x speedup compared to existing frameworks.
Score: 10.92493656178845
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The increasing size and complexity of modern deep neural networks (DNNs) pose significant challenges for on-device inference on mobile GPUs, with limited memory and computational resources. Existing DNN acceleration frameworks primarily deploy a weight preloading strategy, where all model parameters are loaded into memory before execution on mobile GPUs. We posit that this approach is not adequate for modern DNN workloads that comprise very large model(s) and possibly execution of several distinct models in succession. In this work, we introduce FlashMem, a memory streaming framework designed to efficiently execute large-scale modern DNNs and multi-DNN workloads while minimizing memory consumption and reducing inference latency. Instead of fully preloading weights, FlashMem statically determines model loading schedules and dynamically streams them on demand, leveraging 2.5D texture memory to minimize data transformations and improve execution efficiency. Experimental results on 11 models demonstrate that FlashMem achieves 2.0x to 8.4x memory reduction and 1.7x to 75.0x speedup compared to existing frameworks, enabling efficient execution of large-scale models and multi-DNN support on resource-constrained mobile GPUs.

Related papers

MNN-LLM: A Generic Inference Engine for Fast Large Language Model Deployment on Mobile Devices [4.385815629175844]
MNN-LLM is a framework specifically designed to accelerate the deployment of large language models on mobile devices.<n>It addresses the runtime characteristics of LLMs through model quantization and DRAM-Flash hybrid storage.<n> Notably, MNN-LLM achieves up to a 8.6x speed increase compared to current mainstream LLM-specific frameworks.
arXiv Detail & Related papers (2025-06-12T07:45:29Z)
70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float [52.079202872069835]
Large-scale AI models, such as Large Language Models (LLMs) and Diffusion Models (DMs) have grown rapidly in size.<n>We introduce Dynamic-Length Float (DFloat11), a compression framework that reduces LLM and DM size by 30% while preserving outputs that are bit-for-bit identical to the original model.
arXiv Detail & Related papers (2025-04-15T22:38:38Z)
AI and Memory Wall [81.06494558184049]
We show how memory bandwidth can become the dominant bottleneck for decoder models. We argue for a redesign in model architecture, training, and deployment strategies to overcome this memory limitation.
arXiv Detail & Related papers (2024-03-21T04:31:59Z)
Demand Layering for Real-Time DNN Inference with Minimized Memory Usage [2.5768647103950357]
Deep neural network (DNN) model parameters are loaded into GPU memory before execution. We present Demand Layering, which exploits the layer-by-layer execution of DNNs. Our implementation shows a 96.5% memory reduction with just 14.8% delay overhead on average.
arXiv Detail & Related papers (2022-10-08T13:38:48Z)
Harmony: Overcoming the hurdles of GPU memory capacity to train massive DNN models on commodity servers [13.620650014358413]
Deep neural networks (DNNs) have grown exponentially in complexity and size over the past decade. One of the main challenges for researchers who might have access to only limited resources is limited memory capacity compared to model size.
arXiv Detail & Related papers (2022-02-02T22:16:27Z)
Phantom: A High-Performance Computational Core for Sparse Convolutional Neural Networks [3.198144010381572]
Sparse convolutional neural networks (CNNs) have gained significant traction over the past few years. They can drastically decrease the model size and computations, if exploited befittingly, as compared to their dense counterparts. Recently proposed sparse accelerators like SCNN, Eyeriss v2, and SparTen, actively exploit the two-sided or full sparsity, that is, sparsity in both weights and activations, for performance gains. These accelerators either have inefficient micro-architecture, which limits their performance, have no support for non-unit stride convolutions and fully-connected layers, or suffer
arXiv Detail & Related papers (2021-11-09T08:43:03Z)
Towards Memory-Efficient Neural Networks via Multi-Level in situ Generation [10.563649948220371]
Deep neural networks (DNN) have shown superior performance in a variety of tasks. As they rapidly evolve, their escalating computation and memory demands make it challenging to deploy them on resource-constrained edge devices. We propose a general and unified framework to trade expensive memory transactions with ultra-fast on-chip computations.
arXiv Detail & Related papers (2021-08-25T18:50:24Z)
SmartDeal: Re-Modeling Deep Network Weights for Efficient Inference and Training [82.35376405568975]
Deep neural networks (DNNs) come with heavy parameterization, leading to external dynamic random-access memory (DRAM) for storage. We present SmartDeal (SD), an algorithm framework to trade higher-cost memory storage/access for lower-cost computation. We show that SD leads to 10.56x and 4.48x reduction in the storage and training energy, with negligible accuracy loss compared to state-of-the-art training baselines.
arXiv Detail & Related papers (2021-01-04T18:54:07Z)
Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling. Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z)
Towards Real-Time DNN Inference on Mobile Platforms with Model Pruning and Compiler Optimization [56.3111706960878]
High-end mobile platforms serve as primary computing devices for a wide range of Deep Neural Network (DNN) applications. constrained computation and storage resources on these devices pose significant challenges for real-time inference executions. We propose a set of hardware-friendly structured model pruning and compiler optimization techniques to accelerate DNN executions on mobile devices.
arXiv Detail & Related papers (2020-04-22T03:18:23Z)
PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning [57.20262984116752]
We introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space. With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency.
arXiv Detail & Related papers (2020-01-01T04:52:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.