SCELMo: Source Code Embeddings from Language Models
- URL: http://arxiv.org/abs/2004.13214v1
- Date: Tue, 28 Apr 2020 00:06:25 GMT
- Title: SCELMo: Source Code Embeddings from Language Models
- Authors: Rafael - Michael Karampatsis and Charles Sutton
- Abstract summary: We introduce a new set of deep contextualized word representations for computer programs based on language models.
We show that even a low-dimensional embedding trained on a relatively small corpus of programs can improve a state-of-the-art machine learning system for bug detection.
- Score: 33.673421734844474
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Continuous embeddings of tokens in computer programs have been used to
support a variety of software development tools, including readability, code
search, and program repair. Contextual embeddings are common in natural
language processing but have not been previously applied in software
engineering. We introduce a new set of deep contextualized word representations
for computer programs based on language models. We train a set of embeddings
using the ELMo (embeddings from language models) framework of Peters et al
(2018). We investigate whether these embeddings are effective when fine-tuned
for the downstream task of bug detection. We show that even a low-dimensional
embedding trained on a relatively small corpus of programs can improve a
state-of-the-art machine learning system for bug detection.
Related papers
- Synthetic Programming Elicitation for Text-to-Code in Very Low-Resource Programming and Formal Languages [21.18996339478024]
We introduce emphsynthetic programming elicitation and compilation (SPEAC)
SPEAC produces syntactically correct programs more frequently and without sacrificing semantic correctness.
We empirically evaluate the performance of SPEAC in a case study for the UCLID5 formal verification language.
arXiv Detail & Related papers (2024-06-05T22:16:19Z) - CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
We propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs.
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language.
Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - A Novel Approach for Automatic Program Repair using Round-Trip
Translation with Large Language Models [50.86686630756207]
Research shows that grammatical mistakes in a sentence can be corrected by translating it to another language and back.
Current generative models for Automatic Program Repair (APR) are pre-trained on source code and fine-tuned for repair.
This paper proposes bypassing the fine-tuning step and using Round-Trip Translation (RTT): translation of code from one programming language to another programming or natural language, and back.
arXiv Detail & Related papers (2024-01-15T22:36:31Z) - Guess & Sketch: Language Model Guided Transpilation [59.02147255276078]
Learned transpilation offers an alternative to manual re-writing and engineering efforts.
Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness.
Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence.
arXiv Detail & Related papers (2023-09-25T15:42:18Z) - Enhancing Automated Program Repair through Fine-tuning and Prompt
Engineering [2.3826139428423576]
Sequence-to-sequence models have been used to transform erroneous programs into correct ones when trained with a large enough dataset.
Some recent studies demonstrated strong empirical evidence that code review could improve the program repair further.
We investigate if this inherent knowledge of PL and NL can be utilized to improve automated program repair.
arXiv Detail & Related papers (2023-04-16T17:29:51Z) - Beyond the C: Retargetable Decompilation using Neural Machine
Translation [5.734661402742406]
We develop a prototype decompiler that is easily retargetable to new languages.
We examine the impact of parameters such as tokenization and training data selection on the quality of decompilation.
We will release our training data, trained decompilation models, and code to help encourage future research into language-agnostic decompilation.
arXiv Detail & Related papers (2022-12-17T20:45:59Z) - BigIssue: A Realistic Bug Localization Benchmark [89.8240118116093]
BigIssue is a benchmark for realistic bug localization.
We provide a general benchmark with a diversity of real and synthetic Java bugs.
We hope to advance the state of the art in bug localization, in turn improving APR performance and increasing its applicability to the modern development cycle.
arXiv Detail & Related papers (2022-07-21T20:17:53Z) - AVATAR: A Parallel Corpus for Java-Python Program Translation [77.86173793901139]
Program translation refers to migrating source code from one language to another.
We present AVATAR, a collection of 9,515 programming problems and their solutions written in two popular languages, Java and Python.
arXiv Detail & Related papers (2021-08-26T05:44:20Z) - Leveraging Language to Learn Program Abstractions and Search Heuristics [66.28391181268645]
We introduce LAPS (Language for Abstraction and Program Search), a technique for using natural language annotations to guide joint learning of libraries and neurally-guided search models for synthesis.
When integrated into a state-of-the-art library learning system (DreamCoder), LAPS produces higher-quality libraries and improves search efficiency and generalization.
arXiv Detail & Related papers (2021-06-18T15:08:47Z) - Automated Source Code Generation and Auto-completion Using Deep
Learning: Comparing and Discussing Current Language-Model-Related Approaches [0.0]
This paper compares different deep learning architectures to create and use language models based on programming code.
We discuss each approach's different strengths and weaknesses and what gaps we find to evaluate the language models or apply them in a real programming context.
arXiv Detail & Related papers (2020-09-16T15:17:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.