Related papers: Exploring the Performance and Efficiency of Transformer Models for NLP on Mobile Devices

Exploring the Performance and Efficiency of Transformer Models for NLP on Mobile Devices

URL: http://arxiv.org/abs/2306.11426v1
Date: Tue, 20 Jun 2023 10:15:01 GMT
Title: Exploring the Performance and Efficiency of Transformer Models for NLP on Mobile Devices
Authors: Ioannis Panopoulos, Sokratis Nikolaidis, Stylianos I. Venieris, Iakovos S. Venieris
Abstract summary: New deep neural network (DNN) architectures and approaches are emerging every few years, driving the field's advancement. Transformers are a relatively new model family that has achieved new levels of accuracy across AI tasks, but poses significant computational challenges. This work aims to make steps towards bridging this gap by examining the current state of Transformers' on-device execution.
Score: 3.809702129519641
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep learning (DL) is characterised by its dynamic nature, with new deep neural network (DNN) architectures and approaches emerging every few years, driving the field's advancement. At the same time, the ever-increasing use of mobile devices (MDs) has resulted in a surge of DNN-based mobile applications. Although traditional architectures, like CNNs and RNNs, have been successfully integrated into MDs, this is not the case for Transformers, a relatively new model family that has achieved new levels of accuracy across AI tasks, but poses significant computational challenges. In this work, we aim to make steps towards bridging this gap by examining the current state of Transformers' on-device execution. To this end, we construct a benchmark of representative models and thoroughly evaluate their performance across MDs with different computational capabilities. Our experimental results show that Transformers are not accelerator-friendly and indicate the need for software and hardware optimisations to achieve efficient deployment.

Related papers

Training-Free ANN-to-SNN Conversion for High-Performance Spiking Transformer [25.25113597897599]
Spiking Neural Networks (SNNs) offer a promising approach for constructing energy-efficient Transformer architectures.<n>We propose a high-performance and training-free ANN-to-SNN conversion framework tailored for Transformer architectures.<n>Our method achieves near-lossless conversion accuracy with significantly lower latency.
arXiv Detail & Related papers (2025-08-11T07:38:32Z)
Chain-of-Thought Enhanced Shallow Transformers for Wireless Symbol Detection [14.363929799618283]
We propose CHain Of thOught Symbol dEtection (CHOOSE), a CoT-enhanced shallow Transformer framework for wireless symbol detection.<n>By introducing autoregressive latent reasoning steps within the hidden space, CHOOSE significantly improves the reasoning capacity of shallow models.<n> Experimental results demonstrate that our approach outperforms conventional shallow Transformers and achieves performance comparable to that of deep Transformers.
arXiv Detail & Related papers (2025-06-26T08:41:45Z)
Learning Transformer-based World Models with Contrastive Predictive Coding [58.0159270859475]
We show that the next state prediction objective is insufficient to fully exploit the representation capabilities of Transformers. We propose to extend world model predictions to longer time horizons by introducing TWISTER, a world model using action-conditioned Contrastive Predictive Coding. TWISTER achieves a human-normalized mean score of 162% on the Atari 100k benchmark, setting a new record among state-of-the-art methods that do not employ look-ahead search.
arXiv Detail & Related papers (2025-03-06T13:18:37Z)
On the locality bias and results in the Long Range Arena [49.15148871877941]
The Long Range Arena benchmark was designed to evaluate the performance of Transformer improvements. A new series of architectures such as State Space Models (SSMs) gained some traction, greatly outperforming Transformers in the LRA. We show that while the LRA is a benchmark for long-range dependency modeling, in reality most of the performance comes from short-range dependencies.
arXiv Detail & Related papers (2025-01-24T15:34:50Z)
Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge. Existing methods struggle to balance high model performance with low resource consumption. We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z)
Auto-Train-Once: Controller Network Guided Automatic Network Pruning from Scratch [72.26822499434446]
Auto-Train-Once (ATO) is an innovative network pruning algorithm designed to automatically reduce the computational and storage costs of DNNs. We provide a comprehensive convergence analysis as well as extensive experiments, and the results show that our approach achieves state-of-the-art performance across various model architectures.
arXiv Detail & Related papers (2024-03-21T02:33:37Z)
TransAxx: Efficient Transformers with Approximate Computing [4.347898144642257]
Vision Transformer (ViT) models have shown to be very competitive and often become a popular alternative to Convolutional Neural Networks (CNNs) We propose TransAxx, a framework based on the popular PyTorch library that enables fast inherent support for approximate arithmetic. Our approach uses a Monte Carlo Tree Search (MCTS) algorithm to efficiently search the space of possible configurations.
arXiv Detail & Related papers (2024-02-12T10:16:05Z)
AutoST: Training-free Neural Architecture Search for Spiking Transformers [14.791412391584064]
Spiking Transformers achieve both the energy efficiency of Spiking Neural Networks (SNNs) and the high capacity of Transformers. Existing Spiking Transformer architectures exhibit a notable architectural gap, resulting in suboptimal performance. We introduce AutoST, a training-free NAS method for Spiking Transformers, to rapidly identify high-performance Spiking Transformer architectures.
arXiv Detail & Related papers (2023-07-01T10:19:52Z)
Optimizing Non-Autoregressive Transformers with Contrastive Learning [74.46714706658517]
Non-autoregressive Transformers (NATs) reduce the inference latency of Autoregressive Transformers (ATs) by predicting words all at once rather than in sequential order. In this paper, we propose to ease the difficulty of modality learning via sampling from the model distribution instead of the data distribution.
arXiv Detail & Related papers (2023-05-23T04:20:13Z)
RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z)
Full Stack Optimization of Transformer Inference: a Survey [58.55475772110702]
Transformer models achieve superior accuracy across a wide range of applications. The amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate. There has been an increased focus on making Transformer models more efficient.
arXiv Detail & Related papers (2023-02-27T18:18:13Z)
Lightweight Transformers for Human Activity Recognition on Mobile Devices [0.5505634045241288]
Human Activity Recognition (HAR) on mobile devices has shown to be achievable with lightweight neural models. We present Human Activity Recognition Transformer (HART), a lightweight, sensor-wise transformer architecture. Our experiments on HAR tasks with several publicly available datasets show that HART uses fewer FLoating-point Operations Per Second (FLOPS) and parameters while outperforming current state-of-the-art results.
arXiv Detail & Related papers (2022-09-22T09:42:08Z)
Efficient Sparsely Activated Transformers [0.34410212782758054]
Transformer-based neural networks have achieved state-of-the-art task performance in a number of machine learning domains. Recent work has explored the integration of dynamic behavior into these networks in the form of mixture-of-expert layers. We introduce a novel system named PLANER that takes an existing Transformer-based network and a user-defined latency target.
arXiv Detail & Related papers (2022-08-31T00:44:27Z)
Exploring Transformers for Behavioural Biometrics: A Case Study in Gait Recognition [0.7874708385247353]
This article intends to explore and propose novel gait biometric recognition systems based on Transformers. Several state-of-the-art architectures (Vanilla, Informer, Autoformer, Block-Recurrent Transformer, and THAT) are considered in the experimental framework. Experiments are carried out using the two popular public databases whuGAIT and OU-ISIR.
arXiv Detail & Related papers (2022-06-03T08:08:40Z)
EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision. We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.