JEMMA: An Extensible Java Dataset for ML4Code Applications
- URL: http://arxiv.org/abs/2212.09132v1
- Date: Sun, 18 Dec 2022 17:04:14 GMT
- Title: JEMMA: An Extensible Java Dataset for ML4Code Applications
- Authors: Anjan Karmakar, Miltiadis Allamanis, Romain Robbes
- Abstract summary: We introduce JEMMA, a large-scale, diverse, and high-quality dataset targeted at Machine Learning for Source Code (ML4Code)
Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks.
JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties.
- Score: 34.76698017961728
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine Learning for Source Code (ML4Code) is an active research field in
which extensive experimentation is needed to discover how to best use source
code's richly structured information. With this in mind, we introduce JEMMA, an
Extensible Java Dataset for ML4Code Applications, which is a large-scale,
diverse, and high-quality dataset targeted at ML4Code. Our goal with JEMMA is
to lower the barrier to entry in ML4Code by providing the building blocks to
experiment with source code models and tasks. JEMMA comes with a considerable
amount of pre-processed information such as metadata, representations (e.g.,
code tokens, ASTs, graphs), and several properties (e.g., metrics, static
analysis results) for 50,000 Java projects from the 50KC dataset, with over 1.2
million classes and over 8 million methods. JEMMA is also extensible allowing
users to add new properties and representations to the dataset, and evaluate
tasks on them. Thus, JEMMA becomes a workbench that researchers can use to
experiment with novel representations and tasks operating on source code. To
demonstrate the utility of the dataset, we also report results from two
empirical studies on our data, ultimately showing that significant work lies
ahead in the design of context-aware source code models that can reason over a
broader network of source code entities in a software project, the very task
that JEMMA is designed to help with.
Related papers
- SEART Data Hub: Streamlining Large-Scale Source Code Mining and Pre-Processing [13.717170962455526]
We present the SEART Data Hub, a web application that allows to easily build and pre-process large-scale datasets featuring code mined from public GitHub repositories.
Through a simple web interface, researchers can specify a set of mining criteria as well as specific pre-processing steps they want to perform.
After submitting the request, the user will receive an email with a download link for the required dataset within a few hours.
arXiv Detail & Related papers (2024-09-27T11:42:19Z) - LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation [13.800675921118348]
We propose a novel interactive workflow TiCoder for guided intent clarification.
We present an empirical evaluation of the effectiveness of the workflow to improve code generation accuracy.
We observe an average absolute improvement of 45.97% in the pass@1 code generation accuracy for both datasets and across all LLMs within 5 user interactions.
arXiv Detail & Related papers (2024-04-15T19:16:32Z) - CodeLL: A Lifelong Learning Dataset to Support the Co-Evolution of Data
and Language Models of Code [6.491009626125319]
We introduce CodeLL, a lifelong learning dataset focused on code changes.
Our dataset aims to comprehensively capture code changes across the entire release history of open-source software repositories.
CodeLL enables researchers studying the behaviour of LMs in lifelong fine-tuning settings for learning code changes.
arXiv Detail & Related papers (2023-12-20T01:20:24Z) - Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator [63.762209407570715]
Genixer is a comprehensive data generation pipeline consisting of four key steps.
A synthetic VQA-like dataset trained with LLaVA1.5 enhances performance on 10 out of 12 multimodal benchmarks.
MLLMs trained with task-specific datasets can surpass GPT-4V in generating complex instruction tuning data.
arXiv Detail & Related papers (2023-12-11T09:44:41Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z) - Towards the Imagenets of ML4EDA [24.696892205786742]
We describe our experience curating two large-scale, high-quality datasets for Verilog code generation and logic synthesis.
The first, VeriGen, is a dataset of Verilog code collected from GitHub and Verilog textbooks.
The second, OpenABC-D, is a large-scale, labeled dataset designed to aid ML for logic synthesis.
arXiv Detail & Related papers (2023-10-16T16:35:03Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - CodeTF: One-stop Transformer Library for State-of-the-art Code LLM [72.1638273937025]
We present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence.
Our library supports a collection of pretrained Code LLM models and popular code benchmarks.
We hope CodeTF is able to bridge the gap between machine learning/generative AI and software engineering.
arXiv Detail & Related papers (2023-05-31T05:24:48Z) - Many or Few Samples? Comparing Transfer, Contrastive and Meta-Learning
in Encrypted Traffic Classification [68.19713459228369]
We compare transfer learning, meta-learning and contrastive learning against reference Machine Learning (ML) tree-based and monolithic DL models.
We show that (i) using large datasets we can obtain more general representations, (ii) contrastive learning is the best methodology.
While ML tree-based cannot handle large tasks but fits well small tasks, by means of reusing learned representations, DL methods are reaching tree-based models performance also for small tasks.
arXiv Detail & Related papers (2023-05-21T11:20:49Z) - XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence [9.673614921946932]
This paper introduces XLCoST, Cross-Lingual Code SnippeT dataset, a new benchmark dataset for cross-lingual code intelligence.
Our dataset contains fine-grained parallel data from 8 languages, and supports 10 cross-lingual code tasks.
arXiv Detail & Related papers (2022-06-16T22:49:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.