Lexical Generalization Improves with Larger Models and Longer Training
- URL: http://arxiv.org/abs/2210.12673v2
- Date: Tue, 25 Oct 2022 06:42:50 GMT
- Title: Lexical Generalization Improves with Larger Models and Longer Training
- Authors: Elron Bandel, Yoav Goldberg and Yanai Elazar
- Abstract summary: We analyze the use of lexical overlaps in natural language inference, paraphrase detection, and reading comprehension.
We find that larger models are much less susceptible to adopting lexical overlaps.
- Score: 42.024050065980845
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While fine-tuned language models perform well on many tasks, they were also
shown to rely on superficial surface features such as lexical overlap.
Excessive utilization of such heuristics can lead to failure on challenging
inputs. We analyze the use of lexical overlap heuristics in natural language
inference, paraphrase detection, and reading comprehension (using a novel
contrastive dataset), and find that larger models are much less susceptible to
adopting lexical overlap heuristics. We also find that longer training leads
models to abandon lexical overlap heuristics. Finally, we provide evidence that
the disparity between models size has its source in the pre-trained model
Related papers
- Frequency Explains the Inverse Correlation of Large Language Models'
Size, Training Data Amount, and Surprisal's Fit to Reading Times [15.738530737312335]
Recent studies have shown that as Transformer-based language models become larger and are trained on very large amounts of data, the fit of their surprisal estimates to naturalistic human reading times degrades.
This paper presents a series of analyses showing that word frequency is a key explanatory factor underlying these two trends.
The results indicate that Transformer-based language models' surprisal estimates diverge from human-like expectations due to the superhumanly complex associations they learn for predicting rare words.
arXiv Detail & Related papers (2024-02-03T20:22:54Z) - Longer Fixations, More Computation: Gaze-Guided Recurrent Neural
Networks [12.57650361978445]
Humans read texts at a varying pace, while machine learning models treat each token in the same way.
In this paper, we convert this intuition into a set of novel models with fixation-guided parallel RNNs or layers.
We find that, interestingly, the fixation duration predicted by neural networks bears some resemblance to humans' fixation.
arXiv Detail & Related papers (2023-10-31T21:32:11Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - Rarely a problem? Language models exhibit inverse scaling in their
predictions following few-type quantifiers [0.6091702876917281]
We focus on 'few'-type quantifiers, as in 'few children like toys', which might pose a particular challenge for language models.
We present 960 English sentence stimuli from two human neurolinguistic experiments to 22 autoregressive transformer models of differing sizes.
arXiv Detail & Related papers (2022-12-16T20:01:22Z) - Emergent Abilities of Large Language Models [172.08007363384218]
We consider an ability to be emergent if it is not present in smaller models but is present in larger models.
The existence of such emergence implies that additional scaling could further expand the range of capabilities of language models.
arXiv Detail & Related papers (2022-06-15T17:32:01Z) - Multi-timescale Representation Learning in LSTM Language Models [69.98840820213937]
Language models must capture statistical dependencies between words at timescales ranging from very short to very long.
We derived a theory for how the memory gating mechanism in long short-term memory language models can capture power law decay.
Experiments showed that LSTM language models trained on natural English text learn to approximate this theoretical distribution.
arXiv Detail & Related papers (2020-09-27T02:13:38Z) - Limits of Detecting Text Generated by Large-Scale Language Models [65.46403462928319]
Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns.
Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated.
arXiv Detail & Related papers (2020-02-09T19:53:23Z) - A Simple Joint Model for Improved Contextual Neural Lemmatization [60.802451210656805]
We present a simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages.
Our paper describes the model in addition to training and decoding procedures.
arXiv Detail & Related papers (2019-04-04T02:03:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.