Related papers: Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures

Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures

URL: http://arxiv.org/abs/2303.16100v2
Date: Wed, 12 Apr 2023 20:05:43 GMT
Title: Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures
Authors: Zirui Fu, Aleksandre Avaliani, Marco Donato
Abstract summary: adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks. We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
Score: 68.91874045918112
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Executing machine learning inference tasks on resource-constrained edge devices requires careful hardware-software co-design optimizations. Recent examples have shown how transformer-based deep neural network models such as ALBERT can be used to enable the execution of natural language processing (NLP) inference on mobile systems-on-chip housing custom hardware accelerators. However, while these existing solutions are effective in alleviating the latency, energy, and area costs of running single NLP tasks, achieving multi-task inference requires running computations over multiple variants of the model parameters, which are tailored to each of the targeted tasks. This approach leads to either prohibitive on-chip memory requirements or paying the cost of off-chip memory access. This paper proposes adapter-ALBERT, an efficient model optimization for maximal data reuse across different tasks. The proposed model's performance and robustness to data compression methods are evaluated across several language tasks from the GLUE benchmark. Additionally, we demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator to extrapolate performance, power, and area improvements over the execution of a traditional ALBERT model on the same hardware platform.

Related papers

ZeroLM: Data-Free Transformer Architecture Search for Language Models [54.83882149157548]
Current automated proxy discovery approaches suffer from extended search times, susceptibility to data overfitting, and structural complexity. This paper introduces a novel zero-cost proxy methodology that quantifies model capacity through efficient weight statistics. Our evaluation demonstrates the superiority of this approach, achieving a Spearman's rho of 0.76 and Kendall's tau of 0.53 on the FlexiBERT benchmark.
arXiv Detail & Related papers (2025-03-24T13:11:22Z)
LIFT: Latent Implicit Functions for Task- and Data-Agnostic Encoding [4.759109475818876]
Implicit Neural Representations (INRs) are proving to be a powerful paradigm in unifying task modeling across diverse data domains. We introduce LIFT, a novel, high-performance framework that captures multiscale information through meta-learning. We also introduce ReLIFT, an enhanced variant of LIFT that incorporates residual connections and expressive frequency encodings.
arXiv Detail & Related papers (2025-03-19T17:00:58Z)
AsCAN: Asymmetric Convolution-Attention Networks for Efficient Recognition and Generation [48.82264764771652]
We introduce AsCAN -- a hybrid architecture, combining both convolutional and transformer blocks. AsCAN supports a variety of tasks: recognition, segmentation, class-conditional image generation. We then scale the same architecture to solve a large-scale text-to-image task and show state-of-the-art performance.
arXiv Detail & Related papers (2024-11-07T18:43:17Z)
Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge. Existing methods struggle to balance high model performance with low resource consumption. We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z)
POMONAG: Pareto-Optimal Many-Objective Neural Architecture Generator [4.09225917049674]
Transferable NAS has emerged, generalizing the search process from dataset-dependent to task-dependent. This paper introduces POMONAG, extending DiffusionNAG via a many-optimal diffusion process. Results were validated on two search spaces -- NAS201 and MobileNetV3 -- and evaluated across 15 image classification datasets.
arXiv Detail & Related papers (2024-09-30T16:05:29Z)
Resource Management for Low-latency Cooperative Fine-tuning of Foundation Models at the Network Edge [35.40849522296486]
Large-scale foundation models (FoMos) can perform human-like intelligence. FoMos need to be adapted to specialized downstream tasks through fine-tuning techniques. We advocate multi-device cooperation within the device-edge cooperative fine-tuning paradigm.
arXiv Detail & Related papers (2024-07-13T12:47:14Z)
Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach [58.57026686186709]
We introduce the Convolutional Transformer layer (ConvFormer) and propose a ConvFormer-based Super-Resolution network (CFSR) CFSR inherits the advantages of both convolution-based and transformer-based approaches. Experiments demonstrate that CFSR strikes an optimal balance between computational cost and performance.
arXiv Detail & Related papers (2024-01-11T03:08:00Z)
A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical Computation Offloading [62.34538208323411]
We propose a multi-head ensemble multi-task learning (MEMTL) approach with a shared backbone and multiple prediction heads (PHs) MEMTL outperforms benchmark methods in both the inference accuracy and mean square error without requiring additional training data.
arXiv Detail & Related papers (2023-09-02T11:01:16Z)
Cheaply Evaluating Inference Efficiency Metrics for Autoregressive Transformer APIs [66.30706841821123]
Large language models (LLMs) power many state-of-the-art systems in natural language processing. LLMs are extremely computationally expensive, even at inference time. We propose a new metric for comparing inference efficiency across models.
arXiv Detail & Related papers (2023-05-03T21:51:42Z)
Attention-Based Model and Deep Reinforcement Learning for Distribution of Event Processing Tasks [0.0]
Event processing is a cornerstone of the dynamic and responsive Internet of Things (IoT) This article investigates the use of deep learning to fairly distribute the tasks. An attention-based neural network model is proposed to generate efficient load balancing solutions.
arXiv Detail & Related papers (2021-12-07T17:16:35Z)
ExPAN(N)D: Exploring Posits for Efficient Artificial Neural Network Design in FPGA-based Systems [4.2612881037640085]
This paper analyzes and ingathers the efficacy of the Posit number representation scheme and the efficiency of fixed-point arithmetic implementations for ANNs. We propose a novel Posit to fixed-point converter for enabling high-performance and energy-efficient hardware implementations for ANNs.
arXiv Detail & Related papers (2020-10-24T11:02:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.