The Larger They Are, the Harder They Fail: Language Models do not
Recognize Identifier Swaps in Python
- URL: http://arxiv.org/abs/2305.15507v1
- Date: Wed, 24 May 2023 18:54:39 GMT
- Title: The Larger They Are, the Harder They Fail: Language Models do not
Recognize Identifier Swaps in Python
- Authors: Antonio Valerio Miceli-Barone, Fazl Barez, Ioannis Konstas, Shay B.
Cohen
- Abstract summary: Large Language Models (LLMs) have successfully been applied to code generation tasks.
We show that LLMs fail to properly generate correct Python code when default function names are swapped.
Some of them even become more confident in their incorrect predictions as the model size increases.
- Score: 34.13276581200455
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have successfully been applied to code
generation tasks, raising the question of how well these models understand
programming. Typical programming languages have invariances and equivariances
in their semantics that human programmers intuitively understand and exploit,
such as the (near) invariance to the renaming of identifiers. We show that LLMs
not only fail to properly generate correct Python code when default function
names are swapped, but some of them even become more confident in their
incorrect predictions as the model size increases, an instance of the recently
discovered phenomenon of Inverse Scaling, which runs contrary to the commonly
observed trend of increasing prediction quality with increasing model size. Our
findings indicate that, despite their astonishing typical-case performance,
LLMs still lack a deep, abstract understanding of the content they manipulate,
making them unsuitable for tasks that statistically deviate from their training
data, and that mere scaling is not enough to achieve such capability.
Related papers
- Unforgettable Generalization in Language Models [46.98652406155007]
We study the behavior of language models (LMs) in which tasks have been forgotten via fine-tuning on randomized labels.
Across tasks, however, LMs exhibit extreme variability in whether LM predictions change on examples outside the training set.
arXiv Detail & Related papers (2024-09-03T18:55:54Z) - Sign of the Times: Evaluating the use of Large Language Models for Idiomaticity Detection [2.2724928083094196]
This work looks at the performance of a range of LLMs on three idiomaticity datasets: SemEval 2022 Task 2a, FLUTE, and MAGPIE.
We find that whilst these models do give competitive performance, they do not match the results of fine-tuned task-specific models, even at the largest scales.
arXiv Detail & Related papers (2024-05-15T11:55:14Z) - Perplexed: Understanding When Large Language Models are Confused [3.4208414448496027]
This paper introduces perplexed, a library for exploring where a language model is perplexed.
We conducted a case study focused on Large Language Models (LLMs) for code generation using an additional tool we built to help with the analysis of code models called codetokenizer.
We found that our studied code LLMs had their worst performance on coding structures where the code was not syntactically correct.
arXiv Detail & Related papers (2024-04-09T22:03:39Z) - ArthModel: Enhance Arithmetic Skills to Large Language Model [0.0]
This work provides different ways of thinking, training and using a language model.
The codes and models will be released at urlhttps://www.eteced.com/eteced/arithmetic_finetuning_v1.
arXiv Detail & Related papers (2023-11-30T15:06:50Z) - The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics [74.99898531299148]
This research examines vocabulary trimming (VT) inspired by restricting embedding entries to the language of interest to bolster time and memory efficiency.
We apply two languages to trim the full vocabulary - Unicode-based script filtering and corpus-based selection - to different language families and sizes.
It is found that VT reduces the memory usage of small models by nearly 50% and has an upper bound of 25% improvement in generation speed.
arXiv Detail & Related papers (2023-11-16T09:35:50Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z) - The first step is the hardest: Pitfalls of Representing and Tokenizing
Temporal Data for Large Language Models [10.414206635385632]
Large Language Models (LLMs) have demonstrated remarkable generalization across diverse tasks.
A notable obstacle emerges when feeding numerical/temporal data into these models, such as data sourced from wearables or electronic health records.
We discuss recent works that employ LLMs for human-centric tasks such as in mobile health sensing and present a case study showing that popular LLMs tokenize temporal data incorrectly.
arXiv Detail & Related papers (2023-09-12T13:51:29Z) - Interpretability at Scale: Identifying Causal Mechanisms in Alpaca [62.65877150123775]
We use Boundless DAS to efficiently search for interpretable causal structure in large language models while they follow instructions.
Our findings mark a first step toward faithfully understanding the inner-workings of our ever-growing and most widely deployed language models.
arXiv Detail & Related papers (2023-05-15T17:15:40Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines.
In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics.
Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.