A Language Model of Java Methods with Train/Test Deduplication
- URL: http://arxiv.org/abs/2305.08286v1
- Date: Mon, 15 May 2023 00:22:02 GMT
- Title: A Language Model of Java Methods with Train/Test Deduplication
- Authors: Chia-Yi Su, Aakash Bansal, Vijayanta Jain, Sepideh Ghanavati, Collin
Mcmillan
- Abstract summary: This tool demonstration presents a research toolkit for a language model of Java source code.
The target audience includes researchers studying problems at the granularity level of subroutines, statements, or variables in Java.
- Score: 5.529795221640365
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This tool demonstration presents a research toolkit for a language model of
Java source code. The target audience includes researchers studying problems at
the granularity level of subroutines, statements, or variables in Java. In
contrast to many existing language models, we prioritize features for
researchers including an open and easily-searchable training set, a held out
test set with different levels of deduplication from the training set,
infrastructure for deduplicating new examples, and an implementation platform
suitable for execution on equipment accessible to a relatively modest budget.
Our model is a GPT2-like architecture with 350m parameters. Our training set
includes 52m Java methods (9b tokens) and 13m StackOverflow threads (10.5b
tokens). To improve accessibility of research to more members of the community,
we limit local resource requirements to GPUs with 16GB video memory. We provide
a test set of held out Java methods that include descriptive comments,
including the entire Java projects for those methods. We also provide
deduplication tools using precomputed hash tables at various similarity
thresholds to help researchers ensure that their own test examples are not in
the training set. We make all our tools and data open source and available via
Huggingface and Github.
Related papers
- Generating executable oracles to check conformance of client code to requirements of JDK Javadocs using LLMs [21.06722050714324]
This paper focuses on automation of test oracles for clients of widely used Java libraries, e.g., java.lang and java.util packages.
We use large language models as an enabling technology to embody our insight into a framework for test oracle automation.
arXiv Detail & Related papers (2024-11-04T04:24:25Z) - Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders [115.34050914216665]
Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models.
We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features.
We assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models.
arXiv Detail & Related papers (2024-10-27T17:33:49Z) - Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning [50.26965628047682]
Adapting pre-trained models to open classes is a challenging problem in machine learning.
In this paper, we consider combining the advantages of both and come up with a test-time prompt tuning approach.
Our proposed method outperforms all comparison methods on average considering both base and new classes.
arXiv Detail & Related papers (2024-08-29T12:34:01Z) - CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation.
We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks.
We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z) - Outside the Sandbox: A Study of Input/Output Methods in Java [0.0]
We manually categorized 1435 native methods in a Java Standard Edition distribution into non-I/O and I/O-related methods.
Results showed that 21% of the executed methods directly or indirectly called an I/O native.
We conclude that I/O is not a viable option for tool designers and suggest the integration of I/O-related metadata with source code.
arXiv Detail & Related papers (2023-06-20T20:54:02Z) - Test-Time Training on Nearest Neighbors for Large Language Models [25.365366617508663]
We build a large-scale distributed index based on text embeddings of the Pile dataset.
For each test input, our system retrieves its neighbors and fine-tunes the model on their text.
Surprisingly, retrieving and training on as few as 20 neighbors drastically improves performance across more than 20 language modeling tasks.
arXiv Detail & Related papers (2023-05-29T08:03:28Z) - Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation.
We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z) - Scaling Expert Language Models with Unsupervised Domain Discovery [107.08940500543447]
We introduce a simple but effective method to asynchronously train large, sparse language models on arbitrary text corpora.
Our method clusters a corpus into sets of related documents, trains a separate expert language model on each cluster, and combines them in a sparse ensemble for inference.
arXiv Detail & Related papers (2023-03-24T17:38:58Z) - Code Generation Tools (Almost) for Free? A Study of Few-Shot,
Pre-Trained Language Models on Code [13.15617135394116]
Few-shot learning with large-scale, pre-trained language models is a powerful way to answer questions about code.
This paper studies to what extent a state-of-the-art, pre-trained language model of code, Codex, may serve this purpose.
arXiv Detail & Related papers (2022-06-02T23:15:42Z) - SAT-Based Extraction of Behavioural Models for Java Libraries with
Collections [0.087024326813104]
Behavioural models are a valuable tool for software verification, testing, monitoring, publishing etc.
They are rarely provided by the software developers and have to be extracted either from the source or from the compiled code.
Most of these approaches rely on the analysis of the compiled bytecode.
We are looking to extract behavioural models in the form of Finite State Machines (FSMs) from the Java source code to ensure that the obtained FSMs can be easily understood by the software developers.
arXiv Detail & Related papers (2022-05-30T17:27:13Z) - AVATAR: A Parallel Corpus for Java-Python Program Translation [77.86173793901139]
Program translation refers to migrating source code from one language to another.
We present AVATAR, a collection of 9,515 programming problems and their solutions written in two popular languages, Java and Python.
arXiv Detail & Related papers (2021-08-26T05:44:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.