FlexiBERT: Are Current Transformer Architectures too Homogeneous and
Rigid?
- URL: http://arxiv.org/abs/2205.11656v1
- Date: Mon, 23 May 2022 22:44:34 GMT
- Title: FlexiBERT: Are Current Transformer Architectures too Homogeneous and
Rigid?
- Authors: Shikhar Tuli, Bhishma Dedhia, Shreshth Tuli, and Niraj K. Jha
- Abstract summary: We propose a suite of heterogeneous and flexible models, namely FlexiBERT, that have varied encoder layers with a diverse set of possible operations.
We also propose a novel NAS policy, called BOSHNAS, that leverages this new scheme, Bayesian modeling, and second-order optimization.
A comprehensive set of experiments shows that the proposed policy, when applied to the FlexiBERT design space, pushes the performance frontier upwards compared to traditional models.
- Score: 7.813154720635396
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The existence of a plethora of language models makes the problem of selecting
the best one for a custom task challenging. Most state-of-the-art methods
leverage transformer-based models (e.g., BERT) or their variants. Training such
models and exploring their hyperparameter space, however, is computationally
expensive. Prior work proposes several neural architecture search (NAS) methods
that employ performance predictors (e.g., surrogate models) to address this
issue; however, analysis has been limited to homogeneous models that use fixed
dimensionality throughout the network. This leads to sub-optimal architectures.
To address this limitation, we propose a suite of heterogeneous and flexible
models, namely FlexiBERT, that have varied encoder layers with a diverse set of
possible operations and different hidden dimensions. For better-posed surrogate
modeling in this expanded design space, we propose a new graph-similarity-based
embedding scheme. We also propose a novel NAS policy, called BOSHNAS, that
leverages this new scheme, Bayesian modeling, and second-order optimization, to
quickly train and use a neural surrogate model to converge to the optimal
architecture. A comprehensive set of experiments shows that the proposed
policy, when applied to the FlexiBERT design space, pushes the performance
frontier upwards compared to traditional models. FlexiBERT-Mini, one of our
proposed models, has 3% fewer parameters than BERT-Mini and achieves 8.9%
higher GLUE score. A FlexiBERT model with equivalent performance as the best
homogeneous model achieves 2.6x smaller size. FlexiBERT-Large, another proposed
model, achieves state-of-the-art results, outperforming the baseline models by
at least 5.7% on the GLUE benchmark.
Related papers
- A Collaborative Ensemble Framework for CTR Prediction [73.59868761656317]
We propose a novel framework, Collaborative Ensemble Training Network (CETNet), to leverage multiple distinct models.
Unlike naive model scaling, our approach emphasizes diversity and collaboration through collaborative learning.
We validate our framework on three public datasets and a large-scale industrial dataset from Meta.
arXiv Detail & Related papers (2024-11-20T20:38:56Z) - HM3: Hierarchical Multi-Objective Model Merging for Pretrained Models [28.993221775758702]
Model merging is a technique that combines multiple large pretrained models into a single model with enhanced performance and broader task adaptability.
This paper marks a significant advance toward more flexible and comprehensive model merging techniques.
We train policy and value networks using offline sampling of weight vectors, which are then employed for the online optimization of merging strategies.
arXiv Detail & Related papers (2024-09-27T16:31:31Z) - Towards Robust and Efficient Cloud-Edge Elastic Model Adaptation via Selective Entropy Distillation [56.79064699832383]
We establish a Cloud-Edge Elastic Model Adaptation (CEMA) paradigm in which the edge models only need to perform forward propagation.
In our CEMA, to reduce the communication burden, we devise two criteria to exclude unnecessary samples from uploading to the cloud.
arXiv Detail & Related papers (2024-02-27T08:47:19Z) - A Lightweight Feature Fusion Architecture For Resource-Constrained Crowd
Counting [3.5066463427087777]
We introduce two lightweight models to enhance the versatility of crowd-counting models.
These models maintain the same downstream architecture while incorporating two distinct backbones: MobileNet and MobileViT.
We leverage Adjacent Feature Fusion to extract diverse scale features from a Pre-Trained Model (PTM) and subsequently combine these features seamlessly.
arXiv Detail & Related papers (2024-01-11T15:13:31Z) - Fairer and More Accurate Tabular Models Through NAS [14.147928131445852]
We propose using multi-objective Neural Architecture Search (NAS) and Hyperparameter Optimization (HPO) in the first application to the very challenging domain of tabular data.
We show that models optimized solely for accuracy with NAS often fail to inherently address fairness concerns.
We produce architectures that consistently dominate state-of-the-art bias mitigation methods either in fairness, accuracy or both.
arXiv Detail & Related papers (2023-10-18T17:56:24Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Slimmable Domain Adaptation [112.19652651687402]
We introduce a simple framework, Slimmable Domain Adaptation, to improve cross-domain generalization with a weight-sharing model bank.
Our framework surpasses other competing approaches by a very large margin on multiple benchmarks.
arXiv Detail & Related papers (2022-06-14T06:28:04Z) - Data Summarization via Bilevel Optimization [48.89977988203108]
A simple yet powerful approach is to operate on small subsets of data.
In this work, we propose a generic coreset framework that formulates the coreset selection as a cardinality-constrained bilevel optimization problem.
arXiv Detail & Related papers (2021-09-26T09:08:38Z) - Tiny Neural Models for Seq2Seq [0.0]
We propose a projection based encoder-decoder model referred to as pQRNN-MAtt.
The resulting quantized models are less than 3.5MB in size and are well suited for on-device latency critical applications.
We show that on MTOP, a challenging multilingual semantic parsing dataset, the average model performance surpasses LSTM based seq2seq model that uses pre-trained embeddings despite being 85x smaller.
arXiv Detail & Related papers (2021-08-07T00:39:42Z) - Model Fusion via Optimal Transport [64.13185244219353]
We present a layer-wise model fusion algorithm for neural networks.
We show that this can successfully yield "one-shot" knowledge transfer between neural networks trained on heterogeneous non-i.i.d. data.
arXiv Detail & Related papers (2019-10-12T22:07:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.