Is Next Token Prediction Sufficient for GPT? Exploration on Code Logic Comprehension
- URL: http://arxiv.org/abs/2404.08885v1
- Date: Sat, 13 Apr 2024 03:11:07 GMT
- Title: Is Next Token Prediction Sufficient for GPT? Exploration on Code Logic Comprehension
- Authors: Mengnan Qi, Yufan Huang, Yongqiang Yao, Maoquan Wang, Bin Gu, Neel Sundaresan,
- Abstract summary: We propose an advanced pretraining task, "Next Token Prediction+"
Following this pretraining, both Code Llama and StarCoder, the prevalent code domain pretraining models, display significant improvements on our logically equivalent code selection task and the code completion task.
- Score: 18.919972400933393
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) has experienced exponential growth, they demonstrate remarkable performance across various tasks. Notwithstanding, contemporary research primarily centers on enhancing the size and quality of pretraining data, still utilizing the next token prediction task on autoregressive transformer model structure. The efficacy of this task in truly facilitating the model's comprehension of code logic remains questionable, we speculate that it still interprets code as mere text, while human emphasizes the underlying logical knowledge. In order to prove it, we introduce a new task, "Logically Equivalent Code Selection," which necessitates the selection of logically equivalent code from a candidate set, given a query code. Our experimental findings indicate that current LLMs underperform in this task, since they understand code by unordered bag of keywords. To ameliorate their performance, we propose an advanced pretraining task, "Next Token Prediction+". This task aims to modify the sentence embedding distribution of the LLM without sacrificing its generative capabilities. Our experimental results reveal that following this pretraining, both Code Llama and StarCoder, the prevalent code domain pretraining models, display significant improvements on our logically equivalent code selection task and the code completion task.
Related papers
- CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction [47.17755403213469]
We propose CodeI/O, a novel approach that condenses diverse reasoning patterns embedded in contextually-grounded codes.
By training models to predict inputs/outputs given code and test cases entirely in natural language, we expose them to universal reasoning primitives.
Experimental results demonstrate CodeI/O leads to consistent improvements across symbolic, scientific, logic, math & numerical, and commonsense reasoning tasks.
arXiv Detail & Related papers (2025-02-11T07:26:50Z) - Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data [76.90128359866462]
We introduce an extended concept of memorization, distributional memorization, which measures the correlation between the output probabilities and the pretraining data frequency.
We show that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks.
arXiv Detail & Related papers (2024-07-20T21:24:40Z) - Case2Code: Scalable Synthetic Data for Code Generation [105.89741089673575]
Large Language Models (LLMs) have shown outstanding breakthroughs in code generation.
Recent work improves code LLMs by training on synthetic data generated by some powerful LLMs.
We propose a textbfCase2Code task by exploiting the expressiveness and correctness of programs.
arXiv Detail & Related papers (2024-07-17T11:35:00Z) - Does Your Neural Code Completion Model Use My Code? A Membership Inference Approach [66.51005288743153]
We investigate the legal and ethical issues of current neural code completion models.
We tailor a membership inference approach (termed CodeMI) that was originally crafted for classification tasks.
We evaluate the effectiveness of this adapted approach across a diverse array of neural code completion models.
arXiv Detail & Related papers (2024-04-22T15:54:53Z) - Zero-Shot Code Representation Learning via Prompt Tuning [6.40875582886359]
We propose Zecoler, a zero-shot approach for learning code representations.
Zecoler is built upon a pre-trained programming language model.
We evaluate Zecoler in five code intelligence tasks including code clone detection, code search, method name prediction, code summarization, and code generation.
arXiv Detail & Related papers (2024-04-13T09:47:07Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for
Code Understanding and Generation [36.47905744758698]
We present CodeT5, a unified pre-trained encoder-decoder Transformer model that better leverages the code semantics conveyed from the developer-assigned identifiers.
Our model employs a unified framework to seamlessly support both code understanding and generation tasks and allows for multi-task learning.
arXiv Detail & Related papers (2021-09-02T12:21:06Z) - CLSEBERT: Contrastive Learning for Syntax Enhanced Code Pre-Trained
Model [23.947178895479464]
We propose CLSEBERT, a Constrastive Learning Framework for Syntax Enhanced Code Pre-Trained Model.
In the pre-training stage, we consider the code syntax and hierarchy contained in the Abstract Syntax Tree (AST)
We also introduce two novel pre-training objectives. One is to predict the edges between nodes in the abstract syntax tree, and the other is to predict the types of code tokens.
arXiv Detail & Related papers (2021-08-10T10:08:21Z) - InferCode: Self-Supervised Learning of Code Representations by
Predicting Subtrees [17.461451218469062]
This paper proposes InferCode to overcome the limitation by adapting the self-language learning mechanism to build source code model.
Subtrees in ASTs are treated with InferCode as the labels for training code representations without any human labeling effort or the overhead of expensive graph construction.
Compared to previous code learning techniques applied to the same downstream tasks, such as Code2Vec, Code2Seq, ASTNN, higher performance results are achieved using our pre-trained InferCode model.
arXiv Detail & Related papers (2020-12-13T10:33:41Z) - GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code.
We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables.
We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z) - MC-BERT: Efficient Language Pre-Training via a Meta Controller [96.68140474547602]
Large-scale pre-training is computationally expensive.
ELECTRA, an early attempt to accelerate pre-training, trains a discriminative model that predicts whether each input token was replaced by a generator.
We propose a novel meta-learning framework, MC-BERT, to achieve better efficiency and effectiveness.
arXiv Detail & Related papers (2020-06-10T09:22:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.