Resource-Efficient Transformer Architecture: Optimizing Memory and Execution Time for Real-Time Applications
- URL: http://arxiv.org/abs/2501.00042v1
- Date: Wed, 25 Dec 2024 14:41:23 GMT
- Title: Resource-Efficient Transformer Architecture: Optimizing Memory and Execution Time for Real-Time Applications
- Authors: Krisvarish V, Priyadarshini T, K P Abhishek Sri Saai, Vaidehi Vijayakumar,
- Abstract summary: This paper describes a memory-efficient transformer model designed to drive a reduction in memory usage and execution time by substantial orders of magnitude without impairing the model's performance near that of the original model.<n>Results include a 52% reduction in memory usage and a 33% decrease in execution time, resulting in better efficiency than state-of-the-art models.
- Score: 0.1874930567916036
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper describes a memory-efficient transformer model designed to drive a reduction in memory usage and execution time by substantial orders of magnitude without impairing the model's performance near that of the original model. Recently, new architectures of transformers were presented, focused on parameter efficiency and computational optimization; however, such models usually require considerable resources in terms of hardware when deployed in real-world applications on edge devices. This approach addresses this concern by halving embedding size and applying targeted techniques such as parameter pruning and quantization to optimize the memory footprint with minimum sacrifices in terms of accuracy. Experimental results include a 52% reduction in memory usage and a 33% decrease in execution time, resulting in better efficiency than state-of-the-art models. This work compared our model with existing compelling architectures, such as MobileBERT and DistilBERT, and proved its feasibility in the domain of resource-friendly deep learning architectures, mainly for applications in real-time and in resource-constrained applications.
Related papers
- ZeroLM: Data-Free Transformer Architecture Search for Language Models [54.83882149157548]
Current automated proxy discovery approaches suffer from extended search times, susceptibility to data overfitting, and structural complexity.
This paper introduces a novel zero-cost proxy methodology that quantifies model capacity through efficient weight statistics.
Our evaluation demonstrates the superiority of this approach, achieving a Spearman's rho of 0.76 and Kendall's tau of 0.53 on the FlexiBERT benchmark.
arXiv Detail & Related papers (2025-03-24T13:11:22Z) - SegResMamba: An Efficient Architecture for 3D Medical Image Segmentation [2.979183050755201]
We propose an efficient 3D segmentation model for medical imaging called SegResMamba.
Our model uses less than half the memory during training compared to other state-of-the-art (SOTA) architectures.
arXiv Detail & Related papers (2025-03-10T18:40:28Z) - Sparse Gradient Compression for Fine-Tuning Large Language Models [58.44973963468691]
Fine-tuning large language models (LLMs) for downstream tasks has become increasingly crucial due to their widespread use and the growing availability of open-source models.
High memory costs associated with fine-tuning remain a significant challenge, especially as models increase in size.
We propose sparse compression gradient (SGC) to address these limitations.
arXiv Detail & Related papers (2025-02-01T04:18:28Z) - Numerical Pruning for Efficient Autoregressive Models [87.56342118369123]
This paper focuses on compressing decoder-only transformer-based autoregressive models through structural weight pruning.<n>Specifically, we propose a training-free pruning method that calculates a numerical score with Newton's method for the Attention and modules, respectively.<n>To verify the effectiveness of our method, we provide both theoretical support and extensive experiments.
arXiv Detail & Related papers (2024-12-17T01:09:23Z) - FluidML: Fast and Memory Efficient Inference Optimization [3.7676096626244986]
We present FluidML, a generic runtime memory management and optimization framework.
We show that FluidML can consistently reduce the end-to-end inference latency by up to 25.38% for popular language models.
We also show that FluidML can reduce peak memory usage by up to 41.47%, compared to state-of-the-art approaches.
arXiv Detail & Related papers (2024-11-14T07:16:23Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.
This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.
We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - A survey on efficient vision transformers: algorithms, techniques, and
performance benchmarking [19.65897437342896]
Vision Transformer (ViT) architectures are becoming increasingly popular and widely employed to tackle computer vision applications.
This paper mathematically defines the strategies used to make Vision Transformer efficient, describes and discusses state-of-the-art methodologies, and analyzes their performances over different application scenarios.
arXiv Detail & Related papers (2023-09-05T08:21:16Z) - TransCODE: Co-design of Transformers and Accelerators for Efficient
Training and Inference [6.0093441900032465]
We propose a framework that simulates transformer inference and training on a design space of accelerators.
We use this simulator in conjunction with the proposed co-design technique, called TransCODE, to obtain the best-performing models.
The obtained transformer-accelerator pair achieves 0.3% higher accuracy than the state-of-the-art pair.
arXiv Detail & Related papers (2023-03-27T02:45:18Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Efficient Fine-Tuning of BERT Models on the Edge [12.768368718187428]
We propose Freeze And Reconfigure (FAR), a memory-efficient training regime for BERT-like models.
FAR reduces fine-tuning time on the DistilBERT model and CoLA dataset by 30%, and time spent on memory operations by 47%.
More broadly, reductions in metric performance on the GLUE and SQuAD datasets are around 1% on average.
arXiv Detail & Related papers (2022-05-03T14:51:53Z) - Data-Driven Offline Optimization For Architecting Hardware Accelerators [89.68870139177785]
We develop a data-driven offline optimization method for designing hardware accelerators, dubbed PRIME.
PRIME improves performance upon state-of-the-art simulation-driven methods by about 1.54x and 1.20x, while considerably reducing the required total simulation time by 93% and 99%, respectively.
In addition, PRIME also architects effective accelerators for unseen applications in a zero-shot setting, outperforming simulation-based methods by 1.26x.
arXiv Detail & Related papers (2021-10-20T17:06:09Z) - Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling.
Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.