What Goes Into a LM Acceptability Judgment? Rethinking the Impact of Frequency and Length
- URL: http://arxiv.org/abs/2411.02528v1
- Date: Mon, 04 Nov 2024 19:05:49 GMT
- Title: What Goes Into a LM Acceptability Judgment? Rethinking the Impact of Frequency and Length
- Authors: Lindia Tjuatja, Graham Neubig, Tal Linzen, Sophie Hao,
- Abstract summary: We show that MORCELA outperforms a commonly used linking theory for acceptability.
Larger models require a lower relative degree of adjustment for unigram frequency.
Our analysis shows that larger LMs' lower susceptibility to frequency effects can be explained by an ability to better predict rarer words in context.
- Score: 61.71625297655583
- License:
- Abstract: When comparing the linguistic capabilities of language models (LMs) with humans using LM probabilities, factors such as the length of the sequence and the unigram frequency of lexical items have a significant effect on LM probabilities in ways that humans are largely robust to. Prior works in comparing LM and human acceptability judgments treat these effects uniformly across models, making a strong assumption that models require the same degree of adjustment to control for length and unigram frequency effects. We propose MORCELA, a new linking theory between LM scores and acceptability judgments where the optimal level of adjustment for these effects is estimated from data via learned parameters for length and unigram frequency. We first show that MORCELA outperforms a commonly used linking theory for acceptability--SLOR (Pauls and Klein, 2012; Lau et al. 2017)--across two families of transformer LMs (Pythia and OPT). Furthermore, we demonstrate that the assumed degrees of adjustment in SLOR for length and unigram frequency overcorrect for these confounds, and that larger models require a lower relative degree of adjustment for unigram frequency, though a significant amount of adjustment is still necessary for all models. Finally, our subsequent analysis shows that larger LMs' lower susceptibility to frequency effects can be explained by an ability to better predict rarer words in context.
Related papers
- Why Larger Language Models Do In-context Learning Differently? [12.554356517949785]
Large language models (LLM) have emerged as a powerful tool for AI, with the key ability of in-context learning (ICL)
One recent mysterious observation is that models of different scales may have different ICL behaviors.
arXiv Detail & Related papers (2024-05-30T01:11:35Z) - Quantifying Emergence in Large Language Models [31.608080868988825]
We propose a quantifiable solution for estimating emergence of LLMs.
Inspired by emergentism in dynamics, we quantify the strength of emergence by comparing the entropy reduction of the macroscopic (semantic) level with that of the microscopic (token) level.
Our method demonstrates consistent behaviors across a suite of LMs under both in-context learning (ICL) and natural sentences.
arXiv Detail & Related papers (2024-05-21T09:12:20Z) - Tuning Language Models by Proxy [110.49482736590907]
We introduce proxy-tuning, a lightweight decoding-time algorithm that operates on top of black-box LMs to achieve the same end as direct tuning.
Our method tunes a smaller LM, then applies the difference between the predictions of the small tuned and untuned LMs to shift the original predictions of the larger untuned model in the direction of tuning.
arXiv Detail & Related papers (2024-01-16T18:49:55Z) - NoMIRACL: Knowing When You Don't Know for Robust Multilingual
Retrieval-Augmented Generation [92.5132418788568]
Retrieval-augmented generation (RAG) grounds large language model (LLM) output by leveraging external knowledge sources to reduce factual hallucinations.
NoMIRACL is a human-annotated dataset for evaluating LLM robustness in RAG across 18 typologically diverse languages.
We measure robustness using two metrics: (i) hallucination rate, measuring model tendency to hallucinate an answer, when the answer is not present in passages in the non-relevant subset, and (ii) error rate, measuring model inaccuracy to recognize relevant passages in the relevant subset.
arXiv Detail & Related papers (2023-12-18T17:18:04Z) - Large GPT-like Models are Bad Babies: A Closer Look at the Relationship
between Linguistic Competence and Psycholinguistic Measures [25.210837736795565]
We train a series of GPT-like language models of different sizes on the strict version of the BabyLM pretraining corpus.
We find a positive correlation between LM size and performance on all three challenge tasks, with different preferences for model width and depth in each of the tasks.
This suggests that modelling processing effort and linguistic competence may require an approach different from training GPT-like LMs on a developmentally plausible corpus.
arXiv Detail & Related papers (2023-11-08T09:26:27Z) - Scaling Relationship on Learning Mathematical Reasoning with Large
Language Models [75.29595679428105]
We investigate how the pre-training loss, supervised data amount, and augmented data amount influence the reasoning performances of a supervised LLM.
We find that rejection samples from multiple models push LLaMA-7B to an accuracy of 49.3% on GSM8K which outperforms the supervised fine-tuning (SFT) accuracy of 35.9% significantly.
arXiv Detail & Related papers (2023-08-03T15:34:01Z) - To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs.
We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting.
Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z) - On Minimum Word Error Rate Training of the Hybrid Autoregressive
Transducer [40.63693071222628]
We study the minimum word error rate (MWER) training of Hybrid Autoregressive Transducer (HAT)
From experiments with around 30,000 hours of training data, we show that MWER training can improve the accuracy of HAT models.
arXiv Detail & Related papers (2020-10-23T21:16:30Z) - Future Vector Enhanced LSTM Language Model for LVCSR [67.03726018635174]
This paper proposes a novel enhanced long short-term memory (LSTM) LM using the future vector.
Experiments show that, the proposed new LSTM LM gets a better result on BLEU scores for long term sequence prediction.
Rescoring using both the new and conventional LSTM LMs can achieve a very large improvement on the word error rate.
arXiv Detail & Related papers (2020-07-31T08:38:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.