Related papers: pNLP-Mixer: an Efficient all-MLP Architecture for Language

pNLP-Mixer: an Efficient all-MLP Architecture for Language

URL: http://arxiv.org/abs/2202.04350v2
Date: Thu, 25 May 2023 08:48:32 GMT
Title: pNLP-Mixer: an Efficient all-MLP Architecture for Language
Authors: Francesco Fusco, Damian Pascual, Peter Staar, Diego Antognini
Abstract summary: pNLP-Mixer model for on-device NLP achieves high weight-efficiency thanks to a novel projection layer. We evaluate a pNLP-Mixer model of only one megabyte in size on two multi-lingual semantic parsing datasets, MTOP and multiATIS. Our model consistently beats the state-of-the-art of tiny models, which is twice as large, by a margin up to 7.8% on MTOP.
Score: 10.634940525287014
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large pre-trained language models based on transformer architecture have drastically changed the natural language processing (NLP) landscape. However, deploying those models for on-device applications in constrained devices such as smart watches is completely impractical due to their size and inference cost. As an alternative to transformer-based architectures, recent work on efficient NLP has shown that weight-efficient models can attain competitive performance for simple tasks, such as slot filling and intent classification, with model sizes in the order of the megabyte. This work introduces the pNLP-Mixer architecture, an embedding-free MLP-Mixer model for on-device NLP that achieves high weight-efficiency thanks to a novel projection layer. We evaluate a pNLP-Mixer model of only one megabyte in size on two multi-lingual semantic parsing datasets, MTOP and multiATIS. Our quantized model achieves 99.4% and 97.8% the performance of mBERT on MTOP and multi-ATIS, while using 170x fewer parameters. Our model consistently beats the state-of-the-art of tiny models (pQRNN), which is twice as large, by a margin up to 7.8% on MTOP.

Related papers

KAN-Mixers: a new deep learning architecture for image classification [0.0]
The present work aims to design a new mixer-based architecture, called KAN-Mixers, using KANs as main layers and evaluate its performance. As main results obtained, the KAN-Mixers model was superior to theNIST-Mixer and KAN models.
arXiv Detail & Related papers (2025-03-11T22:41:22Z)
MatFormer: Nested Transformer for Elastic Inference [94.1789252941718]
MatFormer is a nested Transformer architecture designed to offer elasticity in a variety of deployment constraints. We show that a 2.6B decoder-only MatFormer language model (MatLM) allows us to extract smaller models spanning from 1.5B to 2.6B. We also observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval.
arXiv Detail & Related papers (2023-10-11T17:57:14Z)
TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting [13.410217680999459]
Transformers have gained popularity in time series forecasting for their ability to capture long-sequence interactions. High memory and computing requirements pose a critical bottleneck for long-term forecasting. We propose TSMixer, a lightweight neural architecture composed of multi-layer perceptron (MLP) modules.
arXiv Detail & Related papers (2023-06-14T06:26:23Z)
eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception. Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency. We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)
MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed. We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z)
DynaMixer: A Vision MLP Architecture with Dynamic Mixing [38.23027495545522]
This paper presents an efficient tasks-like network architecture, dubbed DynaMixer, resorting to dynamic information fusion. We propose a procedure, on which the DynaMixer model relies, to dynamically generate mixing by leveraging the contents of all the tokens to be mixed. Our proposed DynaMixer model (97M parameters) achieves 84.3% top-1 accuracy on the ImageNet-1K, performing favorably against the state-of-the-art vision models.
arXiv Detail & Related papers (2022-01-28T12:43:14Z)
MoEfication: Conditional Computation of Transformer Models for Efficient Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost. We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon. We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z)
Small-Bench NLP: Benchmark for small single GPU trained models in Natural Language Processing [0.0]
Small-Bench NLP is a benchmark for small efficient neural language models trained on a single GPU. Our ELECTRA-DeBERTa small model architecture achieves an average score of 81.53 which is comparable to that of BERT-Base's 82.20 (110M parameters)
arXiv Detail & Related papers (2021-09-22T17:18:55Z)
Sparse MLP for Image Recognition: Is Self-Attention Really Necessary? [65.37917850059017]
We build an attention-free network called sMLPNet. For 2D image tokens, sMLP applies 1D along the axial directions and the parameters are shared among rows or columns. When scaling up to 66M parameters, sMLPNet achieves 83.4% top-1 accuracy, which is on par with the state-of-the-art Swin Transformer.
arXiv Detail & Related papers (2021-09-12T04:05:15Z)
Sparse-MLP: A Fully-MLP Architecture with Conditional Computation [7.901786481399378]
Mixture-of-Experts (MoE) with sparse conditional computation has been proved an effective architecture for scaling attention-based models to more parameters with comparable computation cost. We propose Sparse-MLP, scaling the recent-Mixer model with MoE, to achieve a more-efficient architecture.
arXiv Detail & Related papers (2021-09-05T06:43:08Z)
MLP-Mixer: An all-MLP Architecture for Vision [93.16118698071993]
We present-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). Mixer attains competitive scores on image classification benchmarks, with pre-training and inference comparable to state-of-the-art models.
arXiv Detail & Related papers (2021-05-04T16:17:21Z)
LiteMuL: A Lightweight On-Device Sequence Tagger using Multi-task Learning [1.3192560874022086]
LiteMuL is a lightweight on-device sequence tagger that can efficiently process the user conversations using a Multi-Task Learning approach. Our model is competitive with other MTL approaches for NER and POS tasks while outshines them with a low memory footprint.
arXiv Detail & Related papers (2020-12-15T19:15:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.