A Language Model of Java Methods with Train/Test Deduplication
- URL: http://arxiv.org/abs/2305.08286v1
- Date: Mon, 15 May 2023 00:22:02 GMT
- Title: A Language Model of Java Methods with Train/Test Deduplication
- Authors: Chia-Yi Su, Aakash Bansal, Vijayanta Jain, Sepideh Ghanavati, Collin
Mcmillan
- Abstract summary: This tool demonstration presents a research toolkit for a language model of Java source code.
The target audience includes researchers studying problems at the granularity level of subroutines, statements, or variables in Java.
- Score: 5.529795221640365
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This tool demonstration presents a research toolkit for a language model of
Java source code. The target audience includes researchers studying problems at
the granularity level of subroutines, statements, or variables in Java. In
contrast to many existing language models, we prioritize features for
researchers including an open and easily-searchable training set, a held out
test set with different levels of deduplication from the training set,
infrastructure for deduplicating new examples, and an implementation platform
suitable for execution on equipment accessible to a relatively modest budget.
Our model is a GPT2-like architecture with 350m parameters. Our training set
includes 52m Java methods (9b tokens) and 13m StackOverflow threads (10.5b
tokens). To improve accessibility of research to more members of the community,
we limit local resource requirements to GPUs with 16GB video memory. We provide
a test set of held out Java methods that include descriptive comments,
including the entire Java projects for those methods. We also provide
deduplication tools using precomputed hash tables at various similarity
thresholds to help researchers ensure that their own test examples are not in
the training set. We make all our tools and data open source and available via
Huggingface and Github.
Related papers
- CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification [71.34070740261072]
This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases.
The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.
arXiv Detail & Related papers (2025-02-12T21:42:56Z) - Commit0: Library Generation from Scratch [77.38414688148006]
Commit0 is a benchmark that challenges AI agents to write libraries from scratch.
Agents are provided with a specification document outlining the library's API as well as a suite of interactive unit tests.
Commit0 also offers an interactive environment where models receive static analysis and execution feedback on the code they generate.
arXiv Detail & Related papers (2024-12-02T18:11:30Z) - Generating executable oracles to check conformance of client code to requirements of JDK Javadocs using LLMs [21.06722050714324]
This paper focuses on automation of test oracles for clients of widely used Java libraries, e.g., java.lang and java.util packages.
We use large language models as an enabling technology to embody our insight into a framework for test oracle automation.
arXiv Detail & Related papers (2024-11-04T04:24:25Z) - Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning [50.26965628047682]
Adapting pre-trained models to open classes is a challenging problem in machine learning.
In this paper, we consider combining the advantages of both and come up with a test-time prompt tuning approach.
Our proposed method outperforms all comparison methods on average considering both base and new classes.
arXiv Detail & Related papers (2024-08-29T12:34:01Z) - CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation.
We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks.
We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z) - Outside the Sandbox: A Study of Input/Output Methods in Java [0.0]
We manually categorized 1435 native methods in a Java Standard Edition distribution into non-I/O and I/O-related methods.
Results showed that 21% of the executed methods directly or indirectly called an I/O native.
We conclude that I/O is not a viable option for tool designers and suggest the integration of I/O-related metadata with source code.
arXiv Detail & Related papers (2023-06-20T20:54:02Z) - Test-Time Training on Nearest Neighbors for Large Language Models [25.365366617508663]
We build a large-scale distributed index based on text embeddings of the Pile dataset.
For each test input, our system retrieves its neighbors and fine-tunes the model on their text.
Surprisingly, retrieving and training on as few as 20 neighbors drastically improves performance across more than 20 language modeling tasks.
arXiv Detail & Related papers (2023-05-29T08:03:28Z) - Scaling Expert Language Models with Unsupervised Domain Discovery [107.08940500543447]
We introduce a simple but effective method to asynchronously train large, sparse language models on arbitrary text corpora.
Our method clusters a corpus into sets of related documents, trains a separate expert language model on each cluster, and combines them in a sparse ensemble for inference.
arXiv Detail & Related papers (2023-03-24T17:38:58Z) - Code Generation Tools (Almost) for Free? A Study of Few-Shot,
Pre-Trained Language Models on Code [13.15617135394116]
Few-shot learning with large-scale, pre-trained language models is a powerful way to answer questions about code.
This paper studies to what extent a state-of-the-art, pre-trained language model of code, Codex, may serve this purpose.
arXiv Detail & Related papers (2022-06-02T23:15:42Z) - SAT-Based Extraction of Behavioural Models for Java Libraries with
Collections [0.087024326813104]
Behavioural models are a valuable tool for software verification, testing, monitoring, publishing etc.
They are rarely provided by the software developers and have to be extracted either from the source or from the compiled code.
Most of these approaches rely on the analysis of the compiled bytecode.
We are looking to extract behavioural models in the form of Finite State Machines (FSMs) from the Java source code to ensure that the obtained FSMs can be easily understood by the software developers.
arXiv Detail & Related papers (2022-05-30T17:27:13Z) - AVATAR: A Parallel Corpus for Java-Python Program Translation [77.86173793901139]
Program translation refers to migrating source code from one language to another.
We present AVATAR, a collection of 9,515 programming problems and their solutions written in two popular languages, Java and Python.
arXiv Detail & Related papers (2021-08-26T05:44:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.