Tiny Neural Models for Seq2Seq
- URL: http://arxiv.org/abs/2108.03340v1
- Date: Sat, 7 Aug 2021 00:39:42 GMT
- Title: Tiny Neural Models for Seq2Seq
- Authors: Arun Kandoor
- Abstract summary: We propose a projection based encoder-decoder model referred to as pQRNN-MAtt.
The resulting quantized models are less than 3.5MB in size and are well suited for on-device latency critical applications.
We show that on MTOP, a challenging multilingual semantic parsing dataset, the average model performance surpasses LSTM based seq2seq model that uses pre-trained embeddings despite being 85x smaller.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Semantic parsing models with applications in task oriented dialog systems
require efficient sequence to sequence (seq2seq) architectures to be run
on-device. To this end, we propose a projection based encoder-decoder model
referred to as pQRNN-MAtt. Studies based on projection methods were restricted
to encoder-only models, and we believe this is the first study extending it to
seq2seq architectures. The resulting quantized models are less than 3.5MB in
size and are well suited for on-device latency critical applications. We show
that on MTOP, a challenging multilingual semantic parsing dataset, the average
model performance surpasses LSTM based seq2seq model that uses pre-trained
embeddings despite being 85x smaller. Furthermore, the model can be an
effective student for distilling large pre-trained models such as T5/BERT.
Related papers
- Model Stock: All we need is just a few fine-tuned models [34.449901046895185]
This paper introduces an efficient fine-tuning method for large pre-trained models, offering strong in-distribution (ID) and out-of-distribution (OOD) performance.
We employ significantly fewer models to achieve final weights yet yield superior accuracy.
We demonstrate the efficacy of Model Stock with fine-tuned models based upon pre-trained CLIP architectures.
arXiv Detail & Related papers (2024-03-28T15:57:20Z) - MatFormer: Nested Transformer for Elastic Inference [94.1789252941718]
MatFormer is a nested Transformer architecture designed to offer elasticity in a variety of deployment constraints.
We show that a 2.6B decoder-only MatFormer language model (MatLM) allows us to extract smaller models spanning from 1.5B to 2.6B.
We also observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval.
arXiv Detail & Related papers (2023-10-11T17:57:14Z) - Exploiting the Potential of Seq2Seq Models as Robust Few-Shot Learners [8.43854206194162]
We show that seq2seq models can be highly effective few-shot learners for a wide spectrum of applications.
We propose two methods to more effectively elicit in-context learning ability in seq2seq models: objective-aligned prompting and a fusion-based approach.
arXiv Detail & Related papers (2023-07-27T13:37:06Z) - Recipes for Sequential Pre-training of Multilingual Encoder and Seq2Seq
Models [16.49601740473416]
We explore recipes to improve training efficiency by initializing one model from the other.
Using an encoder to warm-start seq2seq training, we show that we can match task performance of a from-scratch seq2seq model.
arXiv Detail & Related papers (2023-06-14T21:41:52Z) - Speculative Decoding with Big Little Decoder [108.95187338417541]
Big Little Decoder (BiLD) is a framework that can improve inference efficiency and latency for a wide range of text generation applications.
On an NVIDIA T4 GPU, our framework achieves a speedup of up to 2.12x speedup with minimal generation quality degradation.
Our framework is fully plug-and-play and can be applied without any modifications in the training process or model architecture.
arXiv Detail & Related papers (2023-02-15T18:55:29Z) - FlexiBERT: Are Current Transformer Architectures too Homogeneous and
Rigid? [7.813154720635396]
We propose a suite of heterogeneous and flexible models, namely FlexiBERT, that have varied encoder layers with a diverse set of possible operations.
We also propose a novel NAS policy, called BOSHNAS, that leverages this new scheme, Bayesian modeling, and second-order optimization.
A comprehensive set of experiments shows that the proposed policy, when applied to the FlexiBERT design space, pushes the performance frontier upwards compared to traditional models.
arXiv Detail & Related papers (2022-05-23T22:44:34Z) - What Language Model Architecture and Pretraining Objective Work Best for
Zero-Shot Generalization? [50.84738303888189]
We present a large-scale evaluation of modeling choices and their impact on zero-shot generalization.
We train models with over 5 billion parameters for more than 170 billion tokens.
We find that pretrained causal decoder models can be efficiently adapted into non-causal decoder models.
arXiv Detail & Related papers (2022-04-12T14:19:49Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z) - When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable.
In order to achieve a better accuracy, we propose two lightweight modules.
DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers.
QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z) - ANNETTE: Accurate Neural Network Execution Time Estimation with Stacked
Models [56.21470608621633]
We propose a time estimation framework to decouple the architectural search from the target hardware.
The proposed methodology extracts a set of models from micro- kernel and multi-layer benchmarks and generates a stacked model for mapping and network execution time estimation.
We compare estimation accuracy and fidelity of the generated mixed models, statistical models with the roofline model, and a refined roofline model for evaluation.
arXiv Detail & Related papers (2021-05-07T11:39:05Z) - Non-Autoregressive Semantic Parsing for Compositional Task-Oriented
Dialog [22.442123799917074]
We propose a non-autoregressive approach to predict semantic parse trees with an efficient seq2seq model architecture.
By combining non-autoregressive prediction with convolutional neural networks, we achieve significant latency gains and parameter size reduction.
arXiv Detail & Related papers (2021-04-11T05:44:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.