Related papers: Reconsidering the Past: Optimizing Hidden States in Language Models

Reconsidering the Past: Optimizing Hidden States in Language Models

URL: http://arxiv.org/abs/2112.08653v1
Date: Thu, 16 Dec 2021 06:14:37 GMT
Title: Reconsidering the Past: Optimizing Hidden States in Language Models
Authors: Davis Yoshida and Kevin Gimpel
Abstract summary: We present Hidden-State Optimization (HSO), a gradient-based method for improving the performance of transformer language models. HSO computes the gradient of the log-probability the language model assigns to an evaluation text, but uses it to update the cached hidden states rather than the model parameters.
Score: 35.7524942657169
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present Hidden-State Optimization (HSO), a gradient-based method for improving the performance of transformer language models at inference time. Similar to dynamic evaluation (Krause et al., 2018), HSO computes the gradient of the log-probability the language model assigns to an evaluation text, but uses it to update the cached hidden states rather than the model parameters. We test HSO with pretrained Transformer-XL and GPT-2 language models, finding improvement on the WikiText103 and PG-19 datasets in terms of perplexity, especially when evaluating a model outside of its training distribution. We also demonstrate downstream applicability by showing gains in the recently developed prompt-based few-shot evaluation setting, again with no extra parameters or training data.

Related papers

Style Locality for Controllable Generation with kNN Language Models [11.4179290793997]
Nearest neighbor language models retrieve similar contexts to assist in word prediction. The addition of locality levels allows a model to learn how to weight neighbors based on their relative location to the current text in source documents. We show that our model is successfully able to control style and provides a better fluency-style trade-off than previous work.
arXiv Detail & Related papers (2023-11-01T12:21:53Z)
The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length. This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z)
Learning Evaluation Models from Large Language Models for Sequence Generation [44.22820310679188]
Large language models achieve state-of-the-art performance on sequence generation evaluation, but typically have a large number of parameters. We propose textbfECT, an textbfevaluation textbfcapability textbftransfer method, to transfer the evaluation capability from LLMs to relatively lightweight language models. Based on the proposed ECT, we learn various evaluation models from ChatGPT, and employ them as reward models to improve sequence generation models.
arXiv Detail & Related papers (2023-08-08T16:41:16Z)
Meta-Learning Fast Weight Language Models [105.66999854213724]
We present Fast Weight Layers (FWLs), a neural component that provides the benefits of dynamic evaluation much more efficiently. FWLs can be applied at training time so the model learns to make good use of gradient updates.
arXiv Detail & Related papers (2022-12-05T18:37:09Z)
A Generative Language Model for Few-shot Aspect-Based Sentiment Analysis [90.24921443175514]
We focus on aspect-based sentiment analysis, which involves extracting aspect term, category, and predicting their corresponding polarities. We propose to reformulate the extraction and prediction tasks into the sequence generation task, using a generative language model with unidirectional attention. Our approach outperforms the previous state-of-the-art (based on BERT) on average performance by a large margins in few-shot and full-shot settings.
arXiv Detail & Related papers (2022-04-11T18:31:53Z)
Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore. We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z)
Understanding and Improving Lexical Choice in Non-Autoregressive Translation [98.11249019844281]
We propose to expose the raw data to NAT models to restore the useful information of low-frequency words. Our approach pushes the SOTA NAT performance on the WMT14 English-German and WMT16 Romanian-English datasets up to 27.8 and 33.8 BLEU points, respectively.
arXiv Detail & Related papers (2020-12-29T03:18:50Z)
Fine-tuning BERT for Low-Resource Natural Language Understanding via Active Learning [30.5853328612593]
In this work, we explore fine-tuning methods of BERT -- a pre-trained Transformer based language model. Our experimental results show an advantage in model performance by maximizing the approximate knowledge gain of the model. We analyze the benefits of freezing layers of the language model during fine-tuning to reduce the number of trainable parameters.
arXiv Detail & Related papers (2020-12-04T08:34:39Z)
Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting. Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking. We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z)
Pronoun-Targeted Fine-tuning for NMT with Hybrid Losses [6.596002578395152]
We introduce a class of conditional generative-discriminative hybrid losses that we use to fine-tune a trained machine translation model. We improve the model performance of both a sentence-level and a contextual model without using any additional data. Our sentence-level model shows a 0.5 BLEU improvement on both the WMT14 and the IWSLT13 De-En testsets. Our contextual model achieves the best results, improving from 31.81 to 32 BLEU on WMT14 De-En testset, and from 32.10 to 33.13 on the IWSLT13 De-En
arXiv Detail & Related papers (2020-10-15T10:11:40Z)
Learning to Compare for Better Training and Evaluation of Open Domain Natural Language Generation Models [23.62054164511058]
We propose to evaluate natural language generation models by learning to compare a pair of generated sentences by fine-tuning BERT. While able to be trained in a fully self-supervised fashion, our model can be further fine-tuned with a little amount of human preference annotation.
arXiv Detail & Related papers (2020-02-12T15:52:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.