Related papers: Vector-Vector-Matrix Architecture: A Novel Hardware-Aware Framework for Low-Latency Inference in NLP Applications

Vector-Vector-Matrix Architecture: A Novel Hardware-Aware Framework for Low-Latency Inference in NLP Applications

URL: http://arxiv.org/abs/2010.08412v1
Date: Tue, 6 Oct 2020 16:54:08 GMT
Title: Vector-Vector-Matrix Architecture: A Novel Hardware-Aware Framework for Low-Latency Inference in NLP Applications
Authors: Matthew Khoury and Rumen Dangovski and Longwu Ou and Preslav Nakov and Yichen Shen and Li Jing
Abstract summary: Deep neural networks have become the standard approach to building reliable Natural Language Processing (NLP) applications. We propose a novel vector-vector-matrix architecture (VVMA) which greatly reduces the latency at inference time for NMT. We present empirical results suggesting that our framework can reduce the latency of sequence-to-sequence and Transformer models used for NMT by a factor of four.
Score: 23.37992621844846
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep neural networks have become the standard approach to building reliable Natural Language Processing (NLP) applications, ranging from Neural Machine Translation (NMT) to dialogue systems. However, improving accuracy by increasing the model size requires a large number of hardware computations, which can slow down NLP applications significantly at inference time. To address this issue, we propose a novel vector-vector-matrix architecture (VVMA), which greatly reduces the latency at inference time for NMT. This architecture takes advantage of specialized hardware that has low-latency vector-vector operations and higher-latency vector-matrix operations. It also reduces the number of parameters and FLOPs for virtually all models that rely on efficient matrix multipliers without significantly impacting accuracy. We present empirical results suggesting that our framework can reduce the latency of sequence-to-sequence and Transformer models used for NMT by a factor of four. Finally, we show evidence suggesting that our VVMA extends to other domains, and we discuss novel hardware for its efficient use.

Related papers

LUNA: LUT-Based Neural Architecture for Fast and Low-Cost Qubit Readout [0.0]
LUNA is a superconducting qubit readout accelerator that combines low-cost integrator-based preprocessing with Look-Up Table (LUT) based neural networks.<n>We show up to a 10.95x reduction in area and 30% lower latency with little to no loss in fidelity compared to the state-of-the-art.
arXiv Detail & Related papers (2025-12-08T18:41:13Z)
MesaNet: Sequence Modeling by Locally Optimal Test-Time Training [67.45211108321203]
We introduce a numerically stable, chunkwise parallelizable version of the recently proposed Mesa layer.<n>We show that optimal test-time training enables reaching lower language modeling perplexity and higher downstream benchmark performance than previous RNNs.
arXiv Detail & Related papers (2025-06-05T16:50:23Z)
Few-Shot Testing: Estimating Uncertainty of Memristive Deep Neural Networks Using One Bayesian Test Vector [0.0]
We propose a test vector generation framework that can estimate the model uncertainty of NNs implemented on memristor-based CIM hardware. Our method is evaluated on different model dimensions, tasks, fault rates, and variation noise to show that it can consistently achieve $100%$ coverage with only $0.024$ MB of memory overhead.
arXiv Detail & Related papers (2024-05-29T08:53:16Z)
Use Your INSTINCT: INSTruction optimization for LLMs usIng Neural bandits Coupled with Transformers [66.823588073584]
Large language models (LLMs) have shown remarkable instruction-following capabilities and achieved impressive performances in various applications. Recent work has used the query-efficient Bayesian optimization (BO) algorithm to automatically optimize the instructions given to black-box LLMs. We propose a neural bandit algorithm which replaces the GP in BO by an NN surrogate to optimize instructions for black-box LLMs.
arXiv Detail & Related papers (2023-10-02T02:01:16Z)
Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency. We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z)
MAPLE-X: Latency Prediction with Explicit Microprocessor Prior Knowledge [87.41163540910854]
Deep neural network (DNN) latency characterization is a time-consuming process. We propose MAPLE-X which extends MAPLE by incorporating explicit prior knowledge of hardware devices and DNN architecture latency.
arXiv Detail & Related papers (2022-05-25T11:08:20Z)
MemSE: Fast MSE Prediction for Noisy Memristor-Based DNN Accelerators [5.553959304125023]
We theoretically analyze the mean squared error of DNNs that use memristors to compute matrix-vector multiplications (MVM) We take into account both the quantization noise, due to the necessity of reducing the DNN model size, and the programming noise, stemming from the variability during the programming of the memristance value. The proposed method is almost two order of magnitude faster than Monte-Carlo simulation, thus making it possible to optimize the implementation parameters to achieve minimal error for a given power constraint.
arXiv Detail & Related papers (2022-05-03T18:10:43Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)
NN-LUT: Neural Approximation of Non-Linear Operations for Efficient Transformer Inference [9.329021390526124]
Non-linear operations such as GELU, Layer normalization, and Softmax are essential yet costly building blocks of Transformer models. This paper proposes an accurate and hardware-friendly approximation framework for efficient Transformer inference.
arXiv Detail & Related papers (2021-12-03T23:06:57Z)
NullaNet Tiny: Ultra-low-latency DNN Inference Through Fixed-function Combinational Logic [4.119948826527649]
Field-programmable gate array (FPGA)-based accelerators are gaining traction as a serious contender to replace graphics processing unit/central processing unit-based platforms. This paper presents NullaNet Tiny, a framework for constructing resource and energy-efficient, ultra-low-latency FPGA-based neural network accelerators.
arXiv Detail & Related papers (2021-04-07T00:16:39Z)
Joint Deep Reinforcement Learning and Unfolding: Beam Selection and Precoding for mmWave Multiuser MIMO with Lens Arrays [54.43962058166702]
millimeter wave (mmWave) multiuser multiple-input multiple-output (MU-MIMO) systems with discrete lens arrays have received great attention. In this work, we investigate the joint design of a beam precoding matrix for mmWave MU-MIMO systems with DLA.
arXiv Detail & Related papers (2021-01-05T03:55:04Z)
A Learning Framework for n-bit Quantized Neural Networks toward FPGAs [20.83904734716565]
This paper proposes a novel learning framework for n-bit QNNs, whose weights are constrained to the power of two. We also propose a novel QNN structure named n-BQ-NN, which uses shift operation to replace the multiply operation. Experiments show that our n-BQ-NN with our SVPE can execute 2.9 times faster than with the vector processing element (VPE) in inference.
arXiv Detail & Related papers (2020-04-06T04:21:24Z)
PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning [57.20262984116752]
We introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space. With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency.
arXiv Detail & Related papers (2020-01-01T04:52:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.