Related papers: JEMMA: An Extensible Java Dataset for ML4Code Applications

JEMMA: An Extensible Java Dataset for ML4Code Applications

URL: http://arxiv.org/abs/2212.09132v1
Date: Sun, 18 Dec 2022 17:04:14 GMT
Title: JEMMA: An Extensible Java Dataset for ML4Code Applications
Authors: Anjan Karmakar, Miltiadis Allamanis, Romain Robbes
Abstract summary: We introduce JEMMA, a large-scale, diverse, and high-quality dataset targeted at Machine Learning for Source Code (ML4Code) Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks. JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties.
Score: 34.76698017961728
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Machine Learning for Source Code (ML4Code) is an active research field in which extensive experimentation is needed to discover how to best use source code's richly structured information. With this in mind, we introduce JEMMA, an Extensible Java Dataset for ML4Code Applications, which is a large-scale, diverse, and high-quality dataset targeted at ML4Code. Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks. JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties (e.g., metrics, static analysis results) for 50,000 Java projects from the 50KC dataset, with over 1.2 million classes and over 8 million methods. JEMMA is also extensible allowing users to add new properties and representations to the dataset, and evaluate tasks on them. Thus, JEMMA becomes a workbench that researchers can use to experiment with novel representations and tasks operating on source code. To demonstrate the utility of the dataset, we also report results from two empirical studies on our data, ultimately showing that significant work lies ahead in the design of context-aware source code models that can reason over a broader network of source code entities in a software project, the very task that JEMMA is designed to help with.

Related papers

A Vulnerability Code Intent Summary Dataset [3.609135490386991]
This paper proposes an innovative large-scale multi-perspective Code Intent Summary dataset named BADS. It aims to increase the understanding of a given code snippet and reduce the risk in the code developing process. Our dataset and related tools have been publicly released on GitHub.
arXiv Detail & Related papers (2025-04-11T00:39:50Z)
SnipGen: A Mining Repository Framework for Evaluating LLMs for Code [51.07471575337676]
Language Models (LLMs) are trained on extensive datasets that include code repositories. evaluating their effectiveness poses significant challenges due to the potential overlap between the datasets used for training and those employed for evaluation. We introduce SnipGen, a comprehensive repository mining framework designed to leverage prompt engineering across various downstream tasks for code generation.
arXiv Detail & Related papers (2025-02-10T21:28:15Z)
SEART Data Hub: Streamlining Large-Scale Source Code Mining and Pre-Processing [13.717170962455526]
We present the SEART Data Hub, a web application that allows to easily build and pre-process large-scale datasets featuring code mined from public GitHub repositories. Through a simple web interface, researchers can specify a set of mining criteria as well as specific pre-processing steps they want to perform. After submitting the request, the user will receive an email with a download link for the required dataset within a few hours.
arXiv Detail & Related papers (2024-09-27T11:42:19Z)
LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation [13.800675921118348]
We propose a novel interactive workflow TiCoder for guided intent clarification. We present an empirical evaluation of the effectiveness of the workflow to improve code generation accuracy. We observe an average absolute improvement of 45.97% in the pass@1 code generation accuracy for both datasets and across all LLMs within 5 user interactions.
arXiv Detail & Related papers (2024-04-15T19:16:32Z)
WebCode2M: A Real-World Dataset for Code Generation from Webpage Designs [49.91550773480978]
This paper introduces WebCode2M, a new dataset comprising 2.56 million instances, each containing a design image along with the corresponding webpage code and layout details. To validate the effectiveness of WebCode2M, we introduce a baseline model based on the Vision Transformer (ViT), named WebCoder, and establish a benchmark for fair comparison. The benchmarking results demonstrate that our dataset significantly improves the ability of MLLMs to generate code from webpage designs.
arXiv Detail & Related papers (2024-04-09T15:05:48Z)
CodeLL: A Lifelong Learning Dataset to Support the Co-Evolution of Data and Language Models of Code [6.491009626125319]
We introduce CodeLL, a lifelong learning dataset focused on code changes. Our dataset aims to comprehensively capture code changes across the entire release history of open-source software repositories. CodeLL enables researchers studying the behaviour of LMs in lifelong fine-tuning settings for learning code changes.
arXiv Detail & Related papers (2023-12-20T01:20:24Z)
Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator [63.762209407570715]
Genixer is a comprehensive data generation pipeline consisting of four key steps. A synthetic VQA-like dataset trained with LLaVA1.5 enhances performance on 10 out of 12 multimodal benchmarks. MLLMs trained with task-specific datasets can surpass GPT-4V in generating complex instruction tuning data.
arXiv Detail & Related papers (2023-12-11T09:44:41Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
Towards the Imagenets of ML4EDA [24.696892205786742]
We describe our experience curating two large-scale, high-quality datasets for Verilog code generation and logic synthesis. The first, VeriGen, is a dataset of Verilog code collected from GitHub and Verilog textbooks. The second, OpenABC-D, is a large-scale, labeled dataset designed to aid ML for logic synthesis.
arXiv Detail & Related papers (2023-10-16T16:35:03Z)
MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation. Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results. For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data. For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z)
CodeTF: One-stop Transformer Library for State-of-the-art Code LLM [72.1638273937025]
We present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence. Our library supports a collection of pretrained Code LLM models and popular code benchmarks. We hope CodeTF is able to bridge the gap between machine learning/generative AI and software engineering.
arXiv Detail & Related papers (2023-05-31T05:24:48Z)
Many or Few Samples? Comparing Transfer, Contrastive and Meta-Learning in Encrypted Traffic Classification [68.19713459228369]
We compare transfer learning, meta-learning and contrastive learning against reference Machine Learning (ML) tree-based and monolithic DL models. We show that (i) using large datasets we can obtain more general representations, (ii) contrastive learning is the best methodology. While ML tree-based cannot handle large tasks but fits well small tasks, by means of reusing learned representations, DL methods are reaching tree-based models performance also for small tasks.
arXiv Detail & Related papers (2023-05-21T11:20:49Z)
XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence [9.673614921946932]
This paper introduces XLCoST, Cross-Lingual Code SnippeT dataset, a new benchmark dataset for cross-lingual code intelligence. Our dataset contains fine-grained parallel data from 8 languages, and supports 10 cross-lingual code tasks.
arXiv Detail & Related papers (2022-06-16T22:49:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.