Related papers: JavaBERT: Training a transformer-based model for the Java programming language

JavaBERT: Training a transformer-based model for the Java programming language

URL: http://arxiv.org/abs/2110.10404v1
Date: Wed, 20 Oct 2021 06:49:41 GMT
Title: JavaBERT: Training a transformer-based model for the Java programming language
Authors: Nelson Tavares de Sousa, Wilhelm Hasselbring
Abstract summary: We introduce a data retrieval pipeline for software code and train a model upon Java software code. The resulting model, JavaBERT, shows a high accuracy on the masked language modeling task.
Score: 1.599072005190786
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Code quality is and will be a crucial factor while developing new software code, requiring appropriate tools to ensure functional and reliable code. Machine learning techniques are still rarely used for software engineering tools, missing out the potential benefits of its application. Natural language processing has shown the potential to process text data regarding a variety of tasks. We argue, that such models can also show similar benefits for software code processing. In this paper, we investigate how models used for natural language processing can be trained upon software code. We introduce a data retrieval pipeline for software code and train a model upon Java software code. The resulting model, JavaBERT, shows a high accuracy on the masked language modeling task showing its potential for software engineering tools.

Related papers

CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
We propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs. CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language. Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert.
arXiv Detail & Related papers (2024-05-03T02:48:55Z)
CodeTF: One-stop Transformer Library for State-of-the-art Code LLM [72.1638273937025]
We present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence. Our library supports a collection of pretrained Code LLM models and popular code benchmarks. We hope CodeTF is able to bridge the gap between machine learning/generative AI and software engineering.
arXiv Detail & Related papers (2023-05-31T05:24:48Z)
Enhancing Automated Program Repair through Fine-tuning and Prompt Engineering [2.3826139428423576]
Sequence-to-sequence models have been used to transform erroneous programs into correct ones when trained with a large enough dataset. Some recent studies demonstrated strong empirical evidence that code review could improve the program repair further. We investigate if this inherent knowledge of PL and NL can be utilized to improve automated program repair.
arXiv Detail & Related papers (2023-04-16T17:29:51Z)
JaCoText: A Pretrained Model for Java Code-Text Generation [0.0]
We introduce JaCoText, a model based on Transformers neural network. It aims to generate java source code from natural language text. experiments on CONCODE dataset show that JaCoText achieves new state-of-the-art results.
arXiv Detail & Related papers (2023-03-22T19:01:25Z)
A Conversational Paradigm for Program Synthesis [110.94409515865867]
We propose a conversational program synthesis approach via large language models. We train a family of large language models, called CodeGen, on natural language and programming language data. Our findings show the emergence of conversational capabilities and the effectiveness of the proposed conversational program synthesis paradigm.
arXiv Detail & Related papers (2022-03-25T06:55:15Z)
Using Document Similarity Methods to create Parallel Datasets for Code Translation [60.36392618065203]
Translating source code from one programming language to another is a critical, time-consuming task. We propose to use document similarity methods to create noisy parallel datasets of code. We show that these models perform comparably to models trained on ground truth for reasonable levels of noise.
arXiv Detail & Related papers (2021-10-11T17:07:58Z)
AVATAR: A Parallel Corpus for Java-Python Program Translation [77.86173793901139]
Program translation refers to migrating source code from one language to another. We present AVATAR, a collection of 9,515 programming problems and their solutions written in two popular languages, Java and Python.
arXiv Detail & Related papers (2021-08-26T05:44:20Z)
CodeTrans: Towards Cracking the Language of Silicone's Code Through Self-Supervised Deep Learning and High Performance Computing [4.111243115567736]
This paper describes CodeTrans - an encoder-decoder transformer model for tasks in the software engineering domain. It explores the effectiveness of encoder-decoder transformer models for six software engineering tasks, including thirteen sub-tasks. CodeTrans outperforms the state-of-the-art models on all the tasks.
arXiv Detail & Related papers (2021-04-06T11:57:12Z)
Exploring Software Naturalness through Neural Language Models [56.1315223210742]
The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing. We explore this hypothesis through the use of a pre-trained transformer-based language model to perform code analysis tasks.
arXiv Detail & Related papers (2020-06-22T21:56:14Z)
SCELMo: Source Code Embeddings from Language Models [33.673421734844474]
We introduce a new set of deep contextualized word representations for computer programs based on language models. We show that even a low-dimensional embedding trained on a relatively small corpus of programs can improve a state-of-the-art machine learning system for bug detection.
arXiv Detail & Related papers (2020-04-28T00:06:25Z)
CodeBERT: A Pre-Trained Model for Programming and Natural Languages [117.34242908773061]
CodeBERT is a pre-trained model for programming language (PL) and nat-ural language (NL) We develop CodeBERT with Transformer-based neural architecture. We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters.
arXiv Detail & Related papers (2020-02-19T13:09:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.