Related papers: SEART Data Hub: Streamlining Large-Scale Source Code Mining and Pre-Processing

SEART Data Hub: Streamlining Large-Scale Source Code Mining and Pre-Processing

URL: http://arxiv.org/abs/2409.18658v1
Date: Fri, 27 Sep 2024 11:42:19 GMT
Title: SEART Data Hub: Streamlining Large-Scale Source Code Mining and Pre-Processing
Authors: Ozren Dabić, Rosalia Tufano, Gabriele Bavota,
Abstract summary: We present the SEART Data Hub, a web application that allows to easily build and pre-process large-scale datasets featuring code mined from public GitHub repositories. Through a simple web interface, researchers can specify a set of mining criteria as well as specific pre-processing steps they want to perform. After submitting the request, the user will receive an email with a download link for the required dataset within a few hours.
Score: 13.717170962455526
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large-scale code datasets have acquired an increasingly central role in software engineering (SE) research. This is the result of (i) the success of the mining software repositories (MSR) community, that pushed the standards of empirical studies in SE; and (ii) the recent advent of deep learning (DL) in software engineering, with models trained and tested on large source code datasets. While there exist some ready-to-use datasets in the literature, researchers often need to build and pre-process their own dataset to meet specific requirements of the study/technique they are working on. This implies a substantial cost in terms of time and computational resources. In this work we present the SEART Data Hub, a web application that allows to easily build and pre-process large-scale datasets featuring code mined from public GitHub repositories. Through a simple web interface, researchers can specify a set of mining criteria (e.g., only collect code from repositories having more than 100 contributors and more than 1,000 commits) as well as specific pre-processing steps they want to perform (e.g., remove duplicates, test code, instances with syntax errors). After submitting the request, the user will receive an email with a download link for the required dataset within a few hours. A video showcasing the SEART Data Hub is available at https://youtu.be/lCgQaA7CYWA.

Related papers

SWE-smith: Scaling Data for Software Engineering Agents [100.30273957706237]
SWE-smith is a novel pipeline for generating software engineering training data at scale. We create a dataset of 50k instances sourced from 128 GitHub repositories. We train SWE-agent-LM-32B, achieving 40.2% Pass@1 resolve rate on the SWE-bench Verified benchmark.
arXiv Detail & Related papers (2025-04-30T16:56:06Z)
UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process. Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z)
SnipGen: A Mining Repository Framework for Evaluating LLMs for Code [51.07471575337676]
Language Models (LLMs) are trained on extensive datasets that include code repositories. evaluating their effectiveness poses significant challenges due to the potential overlap between the datasets used for training and those employed for evaluation. We introduce SnipGen, a comprehensive repository mining framework designed to leverage prompt engineering across various downstream tasks for code generation.
arXiv Detail & Related papers (2025-02-10T21:28:15Z)
Cuvis.Ai: An Open-Source, Low-Code Software Ecosystem for Hyperspectral Processing and Classification [0.4038539043067986]
cuvis.ai is an open-source and low-code software ecosystem for data acquisition, preprocessing, and model training. The package is written in Python and provides wrappers around common machine learning libraries.
arXiv Detail & Related papers (2024-11-18T06:33:40Z)
Towards a Classification of Open-Source ML Models and Datasets for Software Engineering [52.257764273141184]
Open-source Pre-Trained Models (PTMs) and datasets provide extensive resources for various Machine Learning (ML) tasks. These resources lack a classification tailored to Software Engineering (SE) needs. We apply an SE-oriented classification to PTMs and datasets on a popular open-source ML repository, Hugging Face (HF), and analyze the evolution of PTMs over time.
arXiv Detail & Related papers (2024-11-14T18:52:05Z)
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [70.72097493954067]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems. While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs remain limited. We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an "open cookbook" for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z)
Long Code Arena: a Set of Benchmarks for Long-Context Code Models [75.70507534322336]
Long Code Arena is a suite of six benchmarks for code processing tasks that require project-wide context. These tasks cover different aspects of code processing: library-based code generation, CI builds repair, project-level code completion, commit message generation, bug localization, and module summarization. For each task, we provide a manually verified dataset for testing, an evaluation suite, and open-source baseline solutions.
arXiv Detail & Related papers (2024-06-17T14:58:29Z)
DataAgent: Evaluating Large Language Models' Ability to Answer Zero-Shot, Natural Language Queries [0.0]
We evaluate OpenAI's GPT-3.5 as a "Language Data Scientist" (LDS) The model was tested on a diverse set of benchmark datasets to evaluate its performance across multiple standards.
arXiv Detail & Related papers (2024-03-29T22:59:34Z)
Generating QM1B with PySCF$_{\text{IPU}}$ [40.29005019051567]
This paper introduces the data generator PySCF$_textIPU$ using Intelligence Processing Units (IPUs) It allows us to create the dataset QM1B with one billion training examples containing 9-11 heavy atoms. We highlight several limitations of QM1B and emphasise the low-resolution of our DFT options, which also serves as motivation for even larger, more accurate datasets.
arXiv Detail & Related papers (2023-11-02T10:31:20Z)
Fingerprinting and Building Large Reproducible Datasets [3.2873782624127843]
We propose a tool-supported approach facilitating the creation of large tailored datasets while ensuring their provenance. We propose a way to define a unique fingerprint to characterize a dataset which, when provided to the extraction process, ensures that the same dataset will be extracted.
arXiv Detail & Related papers (2023-06-20T08:59:33Z)
Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow [49.724842920942024]
Industries such as finance, meteorology, and energy generate vast amounts of data daily. We propose Data-Copilot, a data analysis agent that autonomously performs querying, processing, and visualization of massive data tailored to diverse human requests.
arXiv Detail & Related papers (2023-06-12T16:12:56Z)
JEMMA: An Extensible Java Dataset for ML4Code Applications [34.76698017961728]
We introduce JEMMA, a large-scale, diverse, and high-quality dataset targeted at Machine Learning for Source Code (ML4Code) Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks. JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties.
arXiv Detail & Related papers (2022-12-18T17:04:14Z)
KILT: a Benchmark for Knowledge Intensive Language Tasks [102.33046195554886]
We present a benchmark for knowledge-intensive language tasks (KILT) All tasks in KILT are grounded in the same snapshot of Wikipedia. We find that a shared dense vector index coupled with a seq2seq model is a strong baseline.
arXiv Detail & Related papers (2020-09-04T15:32:19Z)
MSC: A Dataset for Macro-Management in StarCraft II [52.52008929278214]
We release a new macro-management dataset based on the platform SC2LE. MSC consists of well-designed feature vectors, pre-defined high-level actions and final result of each match. Besides the dataset, we propose a baseline model and present initial baseline results for global state evaluation and build order prediction.
arXiv Detail & Related papers (2017-10-09T14:59:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.