AutoTinyBERT: Automatic Hyper-parameter Optimization for Efficient
Pre-trained Language Models
- URL: http://arxiv.org/abs/2107.13686v1
- Date: Thu, 29 Jul 2021 00:47:30 GMT
- Title: AutoTinyBERT: Automatic Hyper-parameter Optimization for Efficient
Pre-trained Language Models
- Authors: Yichun Yin, Cheng Chen, Lifeng Shang, Xin Jiang, Xiao Chen, Qun Liu
- Abstract summary: We adopt the one-shot Neural Architecture Search (NAS) to automatically search architecture hyper- parameters.
Specifically, we design the techniques of one-shot learning and the search space to provide an adaptive and efficient development way of tiny PLMs.
We name our method AutoTinyBERT and evaluate its effectiveness on the GLUE and SQuAD benchmarks.
- Score: 46.69439585453071
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained language models (PLMs) have achieved great success in natural
language processing. Most of PLMs follow the default setting of architecture
hyper-parameters (e.g., the hidden dimension is a quarter of the intermediate
dimension in feed-forward sub-networks) in BERT (Devlin et al., 2019). Few
studies have been conducted to explore the design of architecture
hyper-parameters in BERT, especially for the more efficient PLMs with tiny
sizes, which are essential for practical deployment on resource-constrained
devices. In this paper, we adopt the one-shot Neural Architecture Search (NAS)
to automatically search architecture hyper-parameters. Specifically, we
carefully design the techniques of one-shot learning and the search space to
provide an adaptive and efficient development way of tiny PLMs for various
latency constraints. We name our method AutoTinyBERT and evaluate its
effectiveness on the GLUE and SQuAD benchmarks. The extensive experiments show
that our method outperforms both the SOTA search-based baseline (NAS-BERT) and
the SOTA distillation-based methods (such as DistilBERT, TinyBERT, MiniLM and
MobileBERT). In addition, based on the obtained architectures, we propose a
more efficient development method that is even faster than the development of a
single PLM.
Related papers
- Structural Pruning of Pre-trained Language Models via Neural Architecture Search [7.833790713816726]
Pre-trained language models (PLM) mark the state-of-the-art for natural language understanding task when fine-tuned on labeled data.
This paper explores neural architecture search (NAS) for structural pruning to find sub-parts of the fine-tuned network that optimally trade-off efficiency.
arXiv Detail & Related papers (2024-05-03T17:34:57Z) - Fairer and More Accurate Tabular Models Through NAS [14.147928131445852]
We propose using multi-objective Neural Architecture Search (NAS) and Hyperparameter Optimization (HPO) in the first application to the very challenging domain of tabular data.
We show that models optimized solely for accuracy with NAS often fail to inherently address fairness concerns.
We produce architectures that consistently dominate state-of-the-art bias mitigation methods either in fairness, accuracy or both.
arXiv Detail & Related papers (2023-10-18T17:56:24Z) - Neural Architecture Search for Parameter-Efficient Fine-tuning of Large
Pre-trained Language Models [25.33932250843436]
We propose an efficient NAS method for learning PET architectures via structured and unstructured pruning.
We present experiments on GLUE demonstrating the effectiveness of our algorithm and discuss how PET architectural design choices affect performance in practice.
arXiv Detail & Related papers (2023-05-26T03:01:07Z) - Parameter-efficient Tuning of Large-scale Multimodal Foundation Model [68.24510810095802]
We propose A graceful prompt framework for cross-modal transfer (Aurora) to overcome these challenges.
Considering the redundancy in existing architectures, we first utilize the mode approximation to generate 0.1M trainable parameters to implement the multimodal prompt tuning.
A thorough evaluation on six cross-modal benchmarks shows that it not only outperforms the state-of-the-art but even outperforms the full fine-tuning approach.
arXiv Detail & Related papers (2023-05-15T06:40:56Z) - Multi-Agent Reinforcement Learning for Microprocessor Design Space
Exploration [71.95914457415624]
Microprocessor architects are increasingly resorting to domain-specific customization in the quest for high-performance and energy-efficiency.
We propose an alternative formulation that leverages Multi-Agent RL (MARL) to tackle this problem.
Our evaluation shows that the MARL formulation consistently outperforms single-agent RL baselines.
arXiv Detail & Related papers (2022-11-29T17:10:24Z) - Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore.
We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z) - Deploying a BERT-based Query-Title Relevance Classifier in a Production
System: a View from the Trenches [3.1219977244201056]
Bidirectional Representations from Transformers (BERT) model has been radically improving the performance of many Natural Language Processing (NLP) tasks.
It is challenging to scale BERT for low-latency and high- throughput industrial use cases due to its enormous size.
We successfully optimize a Query-Title Relevance (QTR) classifier for deployment via a compact model, which we name BERT Bidirectional Long Short-Term Memory (BertBiLSTM)
BertBiLSTM exceeds the off-the-shelf BERT model's performance in terms of accuracy and efficiency for the aforementioned real-world production task
arXiv Detail & Related papers (2021-08-23T14:28:23Z) - AutoBERT-Zero: Evolving BERT Backbone from Scratch [94.89102524181986]
We propose an Operation-Priority Neural Architecture Search (OP-NAS) algorithm to automatically search for promising hybrid backbone architectures.
We optimize both the search algorithm and evaluation of candidate models to boost the efficiency of our proposed OP-NAS.
Experiments show that the searched architecture (named AutoBERT-Zero) significantly outperforms BERT and its variants of different model capacities in various downstream tasks.
arXiv Detail & Related papers (2021-07-15T16:46:01Z) - You Only Compress Once: Towards Effective and Elastic BERT Compression
via Exploit-Explore Stochastic Nature Gradient [88.58536093633167]
Existing model compression approaches require re-compression or fine-tuning across diverse constraints to accommodate various hardware deployments.
We propose a novel approach, YOCO-BERT, to achieve compress once and deploy everywhere.
Compared with state-of-the-art algorithms, YOCO-BERT provides more compact models, yet achieving 2.1%-4.5% average accuracy improvement on the GLUE benchmark.
arXiv Detail & Related papers (2021-06-04T12:17:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.