pNLP-Mixer: an Efficient all-MLP Architecture for Language
- URL: http://arxiv.org/abs/2202.04350v2
- Date: Thu, 25 May 2023 08:48:32 GMT
- Title: pNLP-Mixer: an Efficient all-MLP Architecture for Language
- Authors: Francesco Fusco, Damian Pascual, Peter Staar, Diego Antognini
- Abstract summary: pNLP-Mixer model for on-device NLP achieves high weight-efficiency thanks to a novel projection layer.
We evaluate a pNLP-Mixer model of only one megabyte in size on two multi-lingual semantic parsing datasets, MTOP and multiATIS.
Our model consistently beats the state-of-the-art of tiny models, which is twice as large, by a margin up to 7.8% on MTOP.
- Score: 10.634940525287014
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large pre-trained language models based on transformer architecture have
drastically changed the natural language processing (NLP) landscape. However,
deploying those models for on-device applications in constrained devices such
as smart watches is completely impractical due to their size and inference
cost. As an alternative to transformer-based architectures, recent work on
efficient NLP has shown that weight-efficient models can attain competitive
performance for simple tasks, such as slot filling and intent classification,
with model sizes in the order of the megabyte. This work introduces the
pNLP-Mixer architecture, an embedding-free MLP-Mixer model for on-device NLP
that achieves high weight-efficiency thanks to a novel projection layer. We
evaluate a pNLP-Mixer model of only one megabyte in size on two multi-lingual
semantic parsing datasets, MTOP and multiATIS. Our quantized model achieves
99.4% and 97.8% the performance of mBERT on MTOP and multi-ATIS, while using
170x fewer parameters. Our model consistently beats the state-of-the-art of
tiny models (pQRNN), which is twice as large, by a margin up to 7.8% on MTOP.
Related papers
- MatFormer: Nested Transformer for Elastic Inference [94.1789252941718]
MatFormer is a nested Transformer architecture designed to offer elasticity in a variety of deployment constraints.
We show that a 2.6B decoder-only MatFormer language model (MatLM) allows us to extract smaller models spanning from 1.5B to 2.6B.
We also observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval.
arXiv Detail & Related papers (2023-10-11T17:57:14Z) - TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series
Forecasting [13.410217680999459]
Transformers have gained popularity in time series forecasting for their ability to capture long-sequence interactions.
High memory and computing requirements pose a critical bottleneck for long-term forecasting.
We propose TSMixer, a lightweight neural architecture composed of multi-layer perceptron (MLP) modules.
arXiv Detail & Related papers (2023-06-14T06:26:23Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - DynaMixer: A Vision MLP Architecture with Dynamic Mixing [38.23027495545522]
This paper presents an efficient tasks-like network architecture, dubbed DynaMixer, resorting to dynamic information fusion.
We propose a procedure, on which the DynaMixer model relies, to dynamically generate mixing by leveraging the contents of all the tokens to be mixed.
Our proposed DynaMixer model (97M parameters) achieves 84.3% top-1 accuracy on the ImageNet-1K, performing favorably against the state-of-the-art vision models.
arXiv Detail & Related papers (2022-01-28T12:43:14Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z) - Small-Bench NLP: Benchmark for small single GPU trained models in
Natural Language Processing [0.0]
Small-Bench NLP is a benchmark for small efficient neural language models trained on a single GPU.
Our ELECTRA-DeBERTa small model architecture achieves an average score of 81.53 which is comparable to that of BERT-Base's 82.20 (110M parameters)
arXiv Detail & Related papers (2021-09-22T17:18:55Z) - Sparse MLP for Image Recognition: Is Self-Attention Really Necessary? [65.37917850059017]
We build an attention-free network called sMLPNet.
For 2D image tokens, sMLP applies 1D along the axial directions and the parameters are shared among rows or columns.
When scaling up to 66M parameters, sMLPNet achieves 83.4% top-1 accuracy, which is on par with the state-of-the-art Swin Transformer.
arXiv Detail & Related papers (2021-09-12T04:05:15Z) - Sparse-MLP: A Fully-MLP Architecture with Conditional Computation [7.901786481399378]
Mixture-of-Experts (MoE) with sparse conditional computation has been proved an effective architecture for scaling attention-based models to more parameters with comparable computation cost.
We propose Sparse-MLP, scaling the recent-Mixer model with MoE, to achieve a more-efficient architecture.
arXiv Detail & Related papers (2021-09-05T06:43:08Z) - MLP-Mixer: An all-MLP Architecture for Vision [93.16118698071993]
We present-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs).
Mixer attains competitive scores on image classification benchmarks, with pre-training and inference comparable to state-of-the-art models.
arXiv Detail & Related papers (2021-05-04T16:17:21Z) - LiteMuL: A Lightweight On-Device Sequence Tagger using Multi-task
Learning [1.3192560874022086]
LiteMuL is a lightweight on-device sequence tagger that can efficiently process the user conversations using a Multi-Task Learning approach.
Our model is competitive with other MTL approaches for NER and POS tasks while outshines them with a low memory footprint.
arXiv Detail & Related papers (2020-12-15T19:15:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.