Related papers: Transformer Language Models Handle Word Frequency in Prediction Head

Transformer Language Models Handle Word Frequency in Prediction Head

URL: http://arxiv.org/abs/2305.18294v1
Date: Mon, 29 May 2023 17:59:15 GMT
Title: Transformer Language Models Handle Word Frequency in Prediction Head
Authors: Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, Kentaro Inui
Abstract summary: This study investigates the inner workings of the prediction head, specifically focusing on bias parameters. Our experiments with BERT and GPT-2 models reveal that the biases in their word prediction heads play a significant role in the models' ability to reflect word frequency in a corpus.
Score: 31.145866381881625
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Prediction head is a crucial component of Transformer language models. Despite its direct impact on prediction, this component has often been overlooked in analyzing Transformers. In this study, we investigate the inner workings of the prediction head, specifically focusing on bias parameters. Our experiments with BERT and GPT-2 models reveal that the biases in their word prediction heads play a significant role in the models' ability to reflect word frequency in a corpus, aligning with the logit adjustment method commonly used in long-tailed learning. We also quantify the effect of controlling the biases in practical auto-regressive text generation scenarios; under a particular setting, more diverse text can be generated without compromising text quality.

Related papers

Linear Recency Bias During Training Improves Transformers' Fit to Reading Times [16.55240473621401]
This paper evaluates a modification of the Transformer model that uses ALiBi, a recency bias added to attention scores. ALiBi's mixture of slopes -- which determine the rate of memory decay in each attention head -- may play a role in helping models with ALiBi to track different kinds of linguistic dependencies.
arXiv Detail & Related papers (2024-09-17T14:57:51Z)
Humans and language models diverge when predicting repeating text [52.03471802608112]
We present a scenario in which the performance of humans and LMs diverges. Human and GPT-2 LM predictions are strongly aligned in the first presentation of a text span, but their performance quickly diverges when memory begins to play a role. We hope that this scenario will spur future work in bringing LMs closer to human behavior.
arXiv Detail & Related papers (2023-10-10T08:24:28Z)
Unlocking Bias Detection: Leveraging Transformer-Based Models for Content Analysis [1.8692054990918079]
We present the Contextualized Bi-Directional Dual Transformer (CBDT) textcolorgreenfaLeaf classifier. We have prepared a dataset specifically for training these models to identify and locate biases in texts. Our evaluations across various datasets demonstrate CBDT textcolorgreen effectiveness in distinguishing biased narratives from neutral ones and identifying specific biased terms.
arXiv Detail & Related papers (2023-09-30T12:06:04Z)
Rationalizing Predictions by Adversarial Information Calibration [65.19407304154177]
We train two models jointly: one is a typical neural model that solves the task at hand in an accurate but black-box manner, and the other is a selector-predictor model that additionally produces a rationale for its prediction. We use an adversarial technique to calibrate the information extracted by the two models such that the difference between them is an indicator of the missed or over-selected features.
arXiv Detail & Related papers (2023-01-15T03:13:09Z)
Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times? [9.909170013118775]
The propensity of larger Transformer-based models to'memorize' sequences during training makes their surprisal estimates diverge from humanlike expectations. These results suggest that the propensity of larger Transformer-based models to'memorize' sequences during training makes their surprisal estimates diverge from humanlike expectations.
arXiv Detail & Related papers (2022-12-23T03:57:54Z)
Shapley Head Pruning: Identifying and Removing Interference in Multilingual Transformers [54.4919139401528]
We show that it is possible to reduce interference by identifying and pruning language-specific parameters. We show that removing identified attention heads from a fixed model improves performance for a target language on both sentence classification and structural prediction.
arXiv Detail & Related papers (2022-10-11T18:11:37Z)
BERT, can HE predict contrastive focus? Predicting and controlling prominence in neural TTS using a language model [29.188684861193092]
We evaluate the accuracy of a BERT model, finetuned to predict quantized acoustic prominence features, on utterances containing contrastive focus. We also evaluate the controllability of pronoun prominence in a TTS model conditioned on acoustic prominence features.
arXiv Detail & Related papers (2022-07-04T20:43:41Z)
A Generative Language Model for Few-shot Aspect-Based Sentiment Analysis [90.24921443175514]
We focus on aspect-based sentiment analysis, which involves extracting aspect term, category, and predicting their corresponding polarities. We propose to reformulate the extraction and prediction tasks into the sequence generation task, using a generative language model with unidirectional attention. Our approach outperforms the previous state-of-the-art (based on BERT) on average performance by a large margins in few-shot and full-shot settings.
arXiv Detail & Related papers (2022-04-11T18:31:53Z)
Experiments with adversarial attacks on text genres [0.0]
Neural models based on pre-trained transformers, such as BERT or XLM-RoBERTa, demonstrate SOTA results in many NLP tasks. We show that embedding-based algorithms which can replace some of the most significant'' words with words similar to them, have the ability to influence model predictions in a significant proportion of cases.
arXiv Detail & Related papers (2021-07-05T19:37:59Z)
Improving Robustness by Augmenting Training Sentences with Predicate-Argument Structures [62.562760228942054]
Existing approaches to improve robustness against dataset biases mostly focus on changing the training objective. We propose to augment the input sentences in the training data with their corresponding predicate-argument structures. We show that without targeting a specific bias, our sentence augmentation improves the robustness of transformer models against multiple biases.
arXiv Detail & Related papers (2020-10-23T16:22:05Z)
Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns. Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.