Related papers: A Language Model of Java Methods with Train/Test Deduplication

A Language Model of Java Methods with Train/Test Deduplication

URL: http://arxiv.org/abs/2305.08286v1
Date: Mon, 15 May 2023 00:22:02 GMT
Title: A Language Model of Java Methods with Train/Test Deduplication
Authors: Chia-Yi Su, Aakash Bansal, Vijayanta Jain, Sepideh Ghanavati, Collin Mcmillan
Abstract summary: This tool demonstration presents a research toolkit for a language model of Java source code. The target audience includes researchers studying problems at the granularity level of subroutines, statements, or variables in Java.
Score: 5.529795221640365
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: This tool demonstration presents a research toolkit for a language model of Java source code. The target audience includes researchers studying problems at the granularity level of subroutines, statements, or variables in Java. In contrast to many existing language models, we prioritize features for researchers including an open and easily-searchable training set, a held out test set with different levels of deduplication from the training set, infrastructure for deduplicating new examples, and an implementation platform suitable for execution on equipment accessible to a relatively modest budget. Our model is a GPT2-like architecture with 350m parameters. Our training set includes 52m Java methods (9b tokens) and 13m StackOverflow threads (10.5b tokens). To improve accessibility of research to more members of the community, we limit local resource requirements to GPUs with 16GB video memory. We provide a test set of held out Java methods that include descriptive comments, including the entire Java projects for those methods. We also provide deduplication tools using precomputed hash tables at various similarity thresholds to help researchers ensure that their own test examples are not in the training set. We make all our tools and data open source and available via Huggingface and Github.

Related papers

EnvBench: A Benchmark for Automated Environment Setup [76.02998475135824]
Large Language Models have enabled researchers to focus on practical repository-level tasks in software engineering domain. Existing studies on environment setup introduce innovative agentic strategies, but their evaluation is often based on small datasets. To address this gap, we introduce a comprehensive environment setup benchmark EnvBench.
arXiv Detail & Related papers (2025-03-18T17:19:12Z)
Static Program Analysis Guided LLM Based Unit Test Generation [2.977347176343005]
We describe a novel approach to automating unit test generation for Java methods using large language models (LLMs) We show that augmenting prompts with emphconcise and emphprecise context information obtained by program analysis %of the focal method increases the effectiveness of generating unit test code through LLMs.
arXiv Detail & Related papers (2025-03-07T13:09:37Z)
Commit0: Library Generation from Scratch [77.38414688148006]
Commit0 is a benchmark that challenges AI agents to write libraries from scratch. Agents are provided with a specification document outlining the library's API as well as a suite of interactive unit tests. Commit0 also offers an interactive environment where models receive static analysis and execution feedback on the code they generate.
arXiv Detail & Related papers (2024-12-02T18:11:30Z)
Generating executable oracles to check conformance of client code to requirements of JDK Javadocs using LLMs [21.06722050714324]
This paper focuses on automation of test oracles for clients of widely used Java libraries, e.g., java.lang and java.util packages. We use large language models as an enabling technology to embody our insight into a framework for test oracle automation.
arXiv Detail & Related papers (2024-11-04T04:24:25Z)
Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders [115.34050914216665]
Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models. We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features. We assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models.
arXiv Detail & Related papers (2024-10-27T17:33:49Z)
Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning [50.26965628047682]
Adapting pre-trained models to open classes is a challenging problem in machine learning. In this paper, we consider combining the advantages of both and come up with a test-time prompt tuning approach. Our proposed method outperforms all comparison methods on average considering both base and new classes.
arXiv Detail & Related papers (2024-08-29T12:34:01Z)
CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation. We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks. We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z)
Outside the Sandbox: A Study of Input/Output Methods in Java [0.0]
We manually categorized 1435 native methods in a Java Standard Edition distribution into non-I/O and I/O-related methods. Results showed that 21% of the executed methods directly or indirectly called an I/O native. We conclude that I/O is not a viable option for tool designers and suggest the integration of I/O-related metadata with source code.
arXiv Detail & Related papers (2023-06-20T20:54:02Z)
Test-Time Training on Nearest Neighbors for Large Language Models [25.365366617508663]
We build a large-scale distributed index based on text embeddings of the Pile dataset. For each test input, our system retrieves its neighbors and fine-tunes the model on their text. Surprisingly, retrieving and training on as few as 20 neighbors drastically improves performance across more than 20 language modeling tasks.
arXiv Detail & Related papers (2023-05-29T08:03:28Z)
Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation. We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z)
Scaling Expert Language Models with Unsupervised Domain Discovery [107.08940500543447]
We introduce a simple but effective method to asynchronously train large, sparse language models on arbitrary text corpora. Our method clusters a corpus into sets of related documents, trains a separate expert language model on each cluster, and combines them in a sparse ensemble for inference.
arXiv Detail & Related papers (2023-03-24T17:38:58Z)
Code Generation Tools (Almost) for Free? A Study of Few-Shot, Pre-Trained Language Models on Code [13.15617135394116]
Few-shot learning with large-scale, pre-trained language models is a powerful way to answer questions about code. This paper studies to what extent a state-of-the-art, pre-trained language model of code, Codex, may serve this purpose.
arXiv Detail & Related papers (2022-06-02T23:15:42Z)
SAT-Based Extraction of Behavioural Models for Java Libraries with Collections [0.087024326813104]
Behavioural models are a valuable tool for software verification, testing, monitoring, publishing etc. They are rarely provided by the software developers and have to be extracted either from the source or from the compiled code. Most of these approaches rely on the analysis of the compiled bytecode. We are looking to extract behavioural models in the form of Finite State Machines (FSMs) from the Java source code to ensure that the obtained FSMs can be easily understood by the software developers.
arXiv Detail & Related papers (2022-05-30T17:27:13Z)
AVATAR: A Parallel Corpus for Java-Python Program Translation [77.86173793901139]
Program translation refers to migrating source code from one language to another. We present AVATAR, a collection of 9,515 programming problems and their solutions written in two popular languages, Java and Python.
arXiv Detail & Related papers (2021-08-26T05:44:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.