Related papers: Efficient Sparsely Activated Transformers

Efficient Sparsely Activated Transformers

URL: http://arxiv.org/abs/2208.14580v1
Date: Wed, 31 Aug 2022 00:44:27 GMT
Title: Efficient Sparsely Activated Transformers
Authors: Salar Latifi, Saurav Muralidharan, Michael Garland
Abstract summary: Transformer-based neural networks have achieved state-of-the-art task performance in a number of machine learning domains. Recent work has explored the integration of dynamic behavior into these networks in the form of mixture-of-expert layers. We introduce a novel system named PLANER that takes an existing Transformer-based network and a user-defined latency target.
Score: 0.34410212782758054
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer-based neural networks have achieved state-of-the-art task performance in a number of machine learning domains including natural language processing and computer vision. To further improve their accuracy, recent work has explored the integration of dynamic behavior into these networks in the form of mixture-of-expert (MoE) layers. In this paper, we explore the introduction of MoE layers to optimize a different metric: inference latency. We introduce a novel system named PLANER that takes an existing Transformer-based network and a user-defined latency target and produces an optimized, sparsely-activated version of the original network that tries to meet the latency target while maintaining baseline accuracy. We evaluate PLANER on two real-world language modeling tasks using the Transformer-XL network and achieve inference latency reductions of over 2x at iso-accuracy.

Related papers

CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks [57.95170323315603]
We introduce CollaPipe, a distributed learning framework that integrates collaborative pipeline parallelism with federated aggregation to support self-evolving networks.<n>In CollaPipe, the encoder part is adaptively partitioned into variable-sized segments and deployed across mobile devices for pipeline-parallel training, while the decoder is deployed on edge servers to handle generative tasks.<n>To enhance training efficiency, we formulate a joint optimization problem that adaptively allocates model segments, micro-batches, bandwidth, and transmission power.
arXiv Detail & Related papers (2025-09-24T07:54:01Z)
BHViT: Binarized Hybrid Vision Transformer [53.38894971164072]
Model binarization has made significant progress in enabling real-time and energy-efficient computation for convolutional neural networks (CNN) We propose BHViT, a binarization-friendly hybrid ViT architecture and its full binarization model with the guidance of three important observations. Our proposed algorithm achieves SOTA performance among binary ViT methods.
arXiv Detail & Related papers (2025-03-04T08:35:01Z)
Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge. Existing methods struggle to balance high model performance with low resource consumption. We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z)
Unifying Dimensions: A Linear Adaptive Approach to Lightweight Image Super-Resolution [6.857919231112562]
Window-based transformers have demonstrated outstanding performance in super-resolution tasks. They exhibit higher computational complexity and inference latency than convolutional neural networks. We construct a convolution-based Transformer framework named the linear adaptive mixer network (LAMNet)
arXiv Detail & Related papers (2024-09-26T07:24:09Z)
ELGC-Net: Efficient Local-Global Context Aggregation for Remote Sensing Change Detection [65.59969454655996]
We propose an efficient change detection framework, ELGC-Net, which leverages rich contextual information to precisely estimate change regions. Our proposed ELGC-Net sets a new state-of-the-art performance in remote sensing change detection benchmarks. We also introduce ELGC-Net-LW, a lighter variant with significantly reduced computational complexity, suitable for resource-constrained settings.
arXiv Detail & Related papers (2024-03-26T17:46:25Z)
Device Sampling and Resource Optimization for Federated Learning in Cooperative Edge Networks [17.637761046608]
Federated learning (FedL) distributes machine learning (ML) across worker devices by having them train local models that are periodically aggregated by a server. FedL ignores two important characteristics of contemporary wireless networks: (i) the network may contain heterogeneous communication/computation resources, and (ii) there may be significant overlaps in devices' local data distributions. We develop a novel optimization methodology that jointly accounts for these factors via intelligent device sampling complemented by device-to-device (D2D) offloading.
arXiv Detail & Related papers (2023-11-07T21:17:59Z)
Accelerating Deep Neural Networks via Semi-Structured Activation Sparsity [0.0]
Exploiting sparsity in the network's feature maps is one of the ways to reduce its inference latency. We propose a solution to induce semi-structured activation sparsity exploitable through minor runtime modifications. Our approach yields a speed improvement of $1.25 times$ with a minimal accuracy drop of $1.1%$ for the ResNet18 model on the ImageNet dataset.
arXiv Detail & Related papers (2023-09-12T22:28:53Z)
Deformable Mixer Transformer with Gating for Multi-Task Learning of Dense Prediction [126.34551436845133]
CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL) We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction.
arXiv Detail & Related papers (2023-08-10T17:37:49Z)
Exploring the Performance and Efficiency of Transformer Models for NLP on Mobile Devices [3.809702129519641]
New deep neural network (DNN) architectures and approaches are emerging every few years, driving the field's advancement. Transformers are a relatively new model family that has achieved new levels of accuracy across AI tasks, but poses significant computational challenges. This work aims to make steps towards bridging this gap by examining the current state of Transformers' on-device execution.
arXiv Detail & Related papers (2023-06-20T10:15:01Z)
Multi-Exit Semantic Segmentation Networks [78.44441236864057]
We propose a framework for converting state-of-the-art segmentation models to MESS networks. specially trained CNNs that employ parametrised early exits along their depth to save during inference on easier samples. We co-optimise the number, placement and architecture of the attached segmentation heads, along with the exit policy, to adapt to the device capabilities and application-specific requirements.
arXiv Detail & Related papers (2021-06-07T11:37:03Z)
Dynamic Slimmable Network [105.74546828182834]
We develop a dynamic network slimming regime named Dynamic Slimmable Network (DS-Net) Our DS-Net is empowered with the ability of dynamic inference by the proposed double-headed dynamic gate. It consistently outperforms its static counterparts as well as state-of-the-art static and dynamic model compression methods.
arXiv Detail & Related papers (2021-03-24T15:25:20Z)
Device Sampling for Heterogeneous Federated Learning: Theory, Algorithms, and Implementation [24.084053136210027]
We develop a sampling methodology based on graph sequential convolutional networks (GCNs) We find that our methodology while sampling less than 5% of all devices outperforms conventional federated learning (FedL) substantially both in terms of trained model accuracy and required resource utilization.
arXiv Detail & Related papers (2021-01-04T05:59:50Z)
An Image Enhancing Pattern-based Sparsity for Real-time Inference on Mobile Devices [58.62801151916888]
We introduce a new sparsity dimension, namely pattern-based sparsity that comprises pattern and connectivity sparsity, and becoming both highly accurate and hardware friendly. Our approach on the new pattern-based sparsity naturally fits into compiler optimization for highly efficient DNN execution on mobile platforms.
arXiv Detail & Related papers (2020-01-20T16:17:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.