Impossible Triangle: What's Next for Pre-trained Language Models?
- URL: http://arxiv.org/abs/2204.06130v1
- Date: Wed, 13 Apr 2022 01:28:18 GMT
- Title: Impossible Triangle: What's Next for Pre-trained Language Models?
- Authors: Chenguang Zhu, Michael Zeng
- Abstract summary: We argue that all existing PLM models lack one or more properties from the Impossible Triangle.
We then offer insights into future research directions of PLMs to achieve the Impossible Triangle.
- Score: 53.99691912972306
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent development of large-scale pre-trained language models (PLM) have
significantly improved the capability of models in various NLP tasks, in terms
of performance after task-specific fine-tuning and zero-shot / few-shot
learning. However, many of such models come with a dauntingly huge size that
few institutions can afford to pre-train, fine-tune or even deploy, while
moderate-sized models usually lack strong generalized few-shot learning
capabilities. In this paper, we first elaborate the current obstacles of using
PLM models in terms of the Impossible Triangle: 1) moderate model size, 2)
state-of-the-art few-shot learning capability, and 3) state-of-the-art
fine-tuning capability. We argue that all existing PLM models lack one or more
properties from the Impossible Triangle. To remedy these missing properties of
PLMs, various techniques have been proposed, such as knowledge distillation,
data augmentation and prompt learning, which inevitably brings additional work
to the application of PLMs in real scenarios. We then offer insights into
future research directions of PLMs to achieve the Impossible Triangle, and
break down the task into several key phases.
Related papers
- LLAVADI: What Matters For Multimodal Large Language Models Distillation [77.73964744238519]
In this work, we do not propose a new efficient model structure or train small-scale MLLMs from scratch.
Our studies involve training strategies, model choices, and distillation algorithms in the knowledge distillation process.
By evaluating different benchmarks and proper strategy, even a 2.7B small-scale model can perform on par with larger models with 7B or 13B parameters.
arXiv Detail & Related papers (2024-07-28T06:10:47Z) - LLM Augmented LLMs: Expanding Capabilities through Composition [56.40953749310957]
CALM -- Composition to Augment Language Models -- introduces cross-attention between models to compose their representations and enable new capabilities.
We illustrate that augmenting PaLM2-S with a smaller model trained on low-resource languages results in an absolute improvement of up to 13% on tasks like translation into English.
When PaLM2-S is augmented with a code-specific model, we see a relative improvement of 40% over the base model for code generation and explanation tasks.
arXiv Detail & Related papers (2024-01-04T18:53:01Z) - Make a Donut: Hierarchical EMD-Space Planning for Zero-Shot Deformable Manipulation with Tools [14.069149456110676]
We introduce a demonstration-free hierarchical planning approach capable of tackling intricate long-horizon tasks.
We employ large language models (LLMs) to articulate a high-level, stage-by-stage plan corresponding to a specified task.
We further substantiate our approach with experimental trials on real-world robotic platforms.
arXiv Detail & Related papers (2023-11-05T22:43:29Z) - Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning [52.29522018586365]
We study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models.
Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains.
arXiv Detail & Related papers (2023-10-10T15:13:30Z) - Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for
Large Language Models [125.91897197446379]
We find that MoE models benefit more from instruction tuning than dense models.
Our most powerful model, FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark tasks.
arXiv Detail & Related papers (2023-05-24T04:22:26Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - WeLM: A Well-Read Pre-trained Language Model for Chinese [37.68378062625651]
We present WeLM: a well-read pre-trained language model for Chinese.
We show that WeLM is equipped with broad knowledge on various domains and languages.
arXiv Detail & Related papers (2022-09-21T14:05:30Z) - Mengzi: Towards Lightweight yet Ingenious Pre-trained Models for Chinese [33.83704598544326]
Mengzi stands for a family of discriminative, generative, domain-specific, and multimodal pre-trained model variants.
Compared with public Chinese PLMs, Mengzi is simple but more powerful.
Our lightweight model has achieved new state-of-the-art results on the widely-used CLUE benchmark.
arXiv Detail & Related papers (2021-10-13T13:14:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.