Related papers: SpeedLimit: Neural Architecture Search for Quantized Transformer Models

SpeedLimit: Neural Architecture Search for Quantized Transformer Models

URL: http://arxiv.org/abs/2209.12127v3
Date: Fri, 13 Oct 2023 17:21:46 GMT
Title: SpeedLimit: Neural Architecture Search for Quantized Transformer Models
Authors: Yuji Chai, Luke Bailey, Yunho Jin, Matthew Karle, Glenn G. Ko, David Brooks, Gu-Yeon Wei, H. T. Kung
Abstract summary: We introduce SpeedLimit, a novel Neural Architecture Search (NAS) technique that optimize accuracy whilst adhering to an upper-bound latency constraint. Our results underline the feasibility and efficacy of seeking an optimal balance between performance and latency, providing new avenues for deploying state-of-the-art transformer models in latency-sensitive environments.
Score: 6.491305435530359
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While research in the field of transformer models has primarily focused on enhancing performance metrics such as accuracy and perplexity, practical applications in industry often necessitate a rigorous consideration of inference latency constraints. Addressing this challenge, we introduce SpeedLimit, a novel Neural Architecture Search (NAS) technique that optimizes accuracy whilst adhering to an upper-bound latency constraint. Our method incorporates 8-bit integer quantization in the search process to outperform the current state-of-the-art technique. Our results underline the feasibility and efficacy of seeking an optimal balance between performance and latency, providing new avenues for deploying state-of-the-art transformer models in latency-sensitive environments.

Related papers

Accelerating Linear Recurrent Neural Networks for the Edge with Unstructured Sparsity [39.483346492111515]
Linear recurrent neural networks enable powerful long-range sequence modeling with constant memory usage and time-per-token during inference. Unstructured sparsity offers a compelling solution, enabling substantial reductions in compute and memory requirements when accelerated by compatible hardware platforms. We find that highly sparse linear RNNs consistently achieve better efficiency-performance trade-offs than dense baselines.
arXiv Detail & Related papers (2025-02-03T13:09:21Z)
Preventing Local Pitfalls in Vector Quantization via Optimal Transport [77.15924044466976]
We introduce OptVQ, a novel vector quantization method that employs the Sinkhorn algorithm to optimize the optimal transport problem. Our experiments on image reconstruction tasks demonstrate that OptVQ achieves 100% codebook utilization and surpasses current state-of-the-art VQNs in reconstruction quality.
arXiv Detail & Related papers (2024-12-19T18:58:14Z)
Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge. Existing methods struggle to balance high model performance with low resource consumption. We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z)
QIANets: Quantum-Integrated Adaptive Networks for Reduced Latency and Improved Inference Times in CNN Models [2.6663666678221376]
Convolutional neural networks (CNNs) have made significant advances in computer vision tasks, yet their high inference times and latency limit real-world applicability. We introduce QIANets: a novel approach of redesigning the traditional GoogLeNet, DenseNet, and ResNet-18 model architectures to process more parameters and computations whilst maintaining low inference times. Despite experimental limitations, the method was tested and evaluated, demonstrating reductions in inference times, along with effective accuracy preservations.
arXiv Detail & Related papers (2024-10-14T09:24:48Z)
PNAS-MOT: Multi-Modal Object Tracking with Pareto Neural Architecture Search [64.28335667655129]
Multiple object tracking is a critical task in autonomous driving. As tracking accuracy improves, neural networks become increasingly complex, posing challenges for their practical application in real driving scenarios due to the high level of latency. In this paper, we explore the use of the neural architecture search (NAS) methods to search for efficient architectures for tracking, aiming for low real-time latency while maintaining relatively high accuracy.
arXiv Detail & Related papers (2024-03-23T04:18:49Z)
DOCTOR: Dynamic On-Chip Temporal Variation Remediation Toward Self-Corrected Photonic Tensor Accelerators [5.873308516576125]
Photonic tensor accelerators offer unparalleled speed and energy efficiency. Off-chip noise-aware training and on-chip training have been proposed to enhance the variation tolerance of optical neural accelerators. We propose a lightweight dynamic on-chip framework, dubbed DOCTOR, providing adaptive, in-situ accuracy recovery against temporally drifting noise.
arXiv Detail & Related papers (2024-03-05T06:17:13Z)
Accelerating Deep Neural Networks via Semi-Structured Activation Sparsity [0.0]
Exploiting sparsity in the network's feature maps is one of the ways to reduce its inference latency. We propose a solution to induce semi-structured activation sparsity exploitable through minor runtime modifications. Our approach yields a speed improvement of $1.25 times$ with a minimal accuracy drop of $1.1%$ for the ResNet18 model on the ImageNet dataset.
arXiv Detail & Related papers (2023-09-12T22:28:53Z)
Towards Long-Term Time-Series Forecasting: Feature, Pattern, and Distribution [57.71199089609161]
Long-term time-series forecasting (LTTF) has become a pressing demand in many applications, such as wind power supply planning. Transformer models have been adopted to deliver high prediction capacity because of the high computational self-attention mechanism. We propose an efficient Transformerbased model, named Conformer, which differentiates itself from existing methods for LTTF in three aspects.
arXiv Detail & Related papers (2023-01-05T13:59:29Z)
Neural Networks with Quantization Constraints [111.42313650830248]
We present a constrained learning approach to quantization training. We show that the resulting problem is strongly dual and does away with gradient estimations. We demonstrate that the proposed approach exhibits competitive performance in image classification tasks.
arXiv Detail & Related papers (2022-10-27T17:12:48Z)
FreeREA: Training-Free Evolution-based Architecture Search [17.202375422110553]
FreeREA is a custom cell-based evolution NAS algorithm that exploits an optimised combination of training-free metrics to rank architectures. Our experiments, carried out on the common benchmarks NAS-Bench-101 and NATS-Bench, demonstrate that i) FreeREA is a fast, efficient, and effective search method for models automatic design.
arXiv Detail & Related papers (2022-06-17T11:16:28Z)
Architecture Aware Latency Constrained Sparse Neural Networks [35.50683537052815]
In this paper, we design an architecture aware latency constrained sparse framework to prune and accelerate CNN models. We also propose a novel sparse convolution algorithm for efficient computation. Our system-algorithm co-design framework can achieve much better frontier among network accuracy and latency on resource-constrained mobile devices.
arXiv Detail & Related papers (2021-09-01T03:41:31Z)
Amortized Auto-Tuning: Cost-Efficient Transfer Optimization for Hyperparameter Recommendation [83.85021205445662]
We propose an instantiation--amortized auto-tuning (AT2) to speed up tuning of machine learning models. We conduct a thorough analysis of the multi-task multi-fidelity Bayesian optimization framework, which leads to the best instantiation--amortized auto-tuning (AT2)
arXiv Detail & Related papers (2021-06-17T00:01:18Z)
Ps and Qs: Quantization-aware pruning for efficient low latency neural network inference [56.24109486973292]
We study the interplay between pruning and quantization during the training of neural networks for ultra low latency applications. We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task.
arXiv Detail & Related papers (2021-02-22T19:00:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.