HyPe: Better Pre-trained Language Model Fine-tuning with Hidden
Representation Perturbation
- URL: http://arxiv.org/abs/2212.08853v2
- Date: Thu, 11 May 2023 07:10:32 GMT
- Title: HyPe: Better Pre-trained Language Model Fine-tuning with Hidden
Representation Perturbation
- Authors: Hongyi Yuan, Zheng Yuan, Chuanqi Tan, Fei Huang, Songfang Huang
- Abstract summary: We propose HyPe, a simple yet effective fine-tuning technique to alleviate problems by perturbing hidden representations of Transformers layers.
We conduct extensive experiments and analyses on GLUE and other natural language inference datasets.
Results demonstrate that HyPe outperforms vanilla fine-tuning and enhances generalization of hidden representations from different layers.
- Score: 50.90457644954857
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language models with the Transformers structure have shown great performance
in natural language processing. However, there still poses problems when
fine-tuning pre-trained language models on downstream tasks, such as
over-fitting or representation collapse. In this work, we propose HyPe, a
simple yet effective fine-tuning technique to alleviate such problems by
perturbing hidden representations of Transformers layers. Unlike previous works
that only add noise to inputs or parameters, we argue that the hidden
representations of Transformers layers convey more diverse and meaningful
language information. Therefore, making the Transformers layers more robust to
hidden representation perturbations can further benefit the fine-tuning of PLMs
en bloc. We conduct extensive experiments and analyses on GLUE and other
natural language inference datasets. Results demonstrate that HyPe outperforms
vanilla fine-tuning and enhances generalization of hidden representations from
different layers. In addition, HyPe acquires negligible computational
overheads, and is better than and compatible with previous state-of-the-art
fine-tuning techniques.
Related papers
- On the Effect of Pre-training for Transformer in Different Modality on
Offline Reinforcement Learning [0.0]
We investigate how pre-training on data of different modalities, such as language and vision, affects fine-tuning of Transformer-based models to Mujoco offline reinforcement learning tasks.
arXiv Detail & Related papers (2022-11-17T13:34:08Z) - An Exploration of Prompt Tuning on Generative Spoken Language Model for
Speech Processing Tasks [112.1942546460814]
We report the first exploration of the prompt tuning paradigm for speech processing tasks based on Generative Spoken Language Model (GSLM)
Experiment results show that the prompt tuning technique achieves competitive performance in speech classification tasks with fewer trainable parameters than fine-tuning specialized downstream models.
arXiv Detail & Related papers (2022-03-31T03:26:55Z) - Pretrained Transformers as Universal Computation Engines [105.00539596788127]
We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning.
We study finetuning it on a variety of sequence classification tasks spanning numerical computation, vision, and protein fold prediction.
We find that such pretraining enables FPT to generalize in zero-shot to these modalities, matching the performance of a transformer fully trained on these tasks.
arXiv Detail & Related papers (2021-03-09T06:39:56Z) - Adapting Pretrained Transformer to Lattices for Spoken Language
Understanding [39.50831917042577]
It is shown that encoding lattices as opposed to 1-best results generated by automatic speech recognizer (ASR) boosts the performance of spoken language understanding (SLU)
This paper aims at adapting pretrained transformers to lattice inputs in order to perform understanding tasks specifically for spoken language.
arXiv Detail & Related papers (2020-11-02T07:14:34Z) - Rethinking embedding coupling in pre-trained language models [46.11201932668366]
We re-evaluate the standard practice of sharing weights between input and output embeddings in pre-trained language models.
We show that decoupled embeddings provide increased modeling flexibility, allowing us to significantly improve the efficiency of parameter allocation.
We are able to train models that achieve strong performance on the XTREME benchmark without increasing the number of parameters at the fine-tuning stage.
arXiv Detail & Related papers (2020-10-24T07:43:00Z) - Is Supervised Syntactic Parsing Beneficial for Language Understanding?
An Empirical Investigation [71.70562795158625]
Traditional NLP has long held (supervised) syntactic parsing necessary for successful higher-level semantic language understanding (LU)
Recent advent of end-to-end neural models, self-supervised via language modeling (LM), and their success on a wide range of LU tasks, questions this belief.
We empirically investigate the usefulness of supervised parsing for semantic LU in the context of LM-pretrained transformer networks.
arXiv Detail & Related papers (2020-08-15T21:03:36Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z) - What Happens To BERT Embeddings During Fine-tuning? [19.016185902256826]
We investigate how fine-tuning affects the representations of the BERT model.
We find that fine-tuning primarily affects the top layers of BERT, but with noteworthy variation across tasks.
In particular, dependency parsing reconfigures most of the model, whereas SQuAD and MNLI appear to involve much shallower processing.
arXiv Detail & Related papers (2020-04-29T19:46:26Z) - Addressing Some Limitations of Transformers with Feedback Memory [51.94640029417114]
Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks.
We propose the Feedback Transformer architecture that exposes all previous representations to all future representations.
We demonstrate on a variety of benchmarks in language modeling, machine translation, and reinforcement learning that the increased representation capacity can create small, shallow models with much stronger performance than comparable Transformers.
arXiv Detail & Related papers (2020-02-21T16:37:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.