CAM: A Collection of Snapshots of GitHub Java Repositories Together with
Metrics
- URL: http://arxiv.org/abs/2403.08488v1
- Date: Wed, 13 Mar 2024 12:52:57 GMT
- Title: CAM: A Collection of Snapshots of GitHub Java Repositories Together with
Metrics
- Authors: Yegor Bugayenko
- Abstract summary: The latest archive of 2.2Gb that we published on the 2nd of March, 2024 includes 532K Java classes with 48 metrics for each class.
At least once a year, we execute the entire script, a process which requires a minimum of ten days on a very powerful server, to generate a new dataset.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Even though numerous researchers require stable datasets along with source
code and basic metrics calculated on them, neither GitHub nor any other code
hosting platform provides such a resource. Consequently, each researcher must
download their own data, compute the necessary metrics, and then publish the
dataset somewhere to ensure it remains accessible indefinitely. Our CAM (stands
for ``Classes and Metrics'') project addresses this need. It is an open-source
software capable of cloning Java repositories from GitHub, filtering out
unnecessary files, parsing Java classes, and computing metrics such as
Cyclomatic Complexity, Halstead Effort and Volume, C\&K metrics,
Maintainability Metrics, LCOM5 and HND, as well as some Git-based Metrics. At
least once a year, we execute the entire script, a process which requires a
minimum of ten days on a very powerful server, to generate a new dataset.
Subsequently, we publish it on Amazon S3, thereby ensuring its availability as
a reference for researchers. The latest archive of 2.2Gb that we published on
the 2nd of March, 2024 includes 532K Java classes with 48 metrics for each
class.
Related papers
- SEART Data Hub: Streamlining Large-Scale Source Code Mining and Pre-Processing [13.717170962455526]
We present the SEART Data Hub, a web application that allows to easily build and pre-process large-scale datasets featuring code mined from public GitHub repositories.
Through a simple web interface, researchers can specify a set of mining criteria as well as specific pre-processing steps they want to perform.
After submitting the request, the user will receive an email with a download link for the required dataset within a few hours.
arXiv Detail & Related papers (2024-09-27T11:42:19Z) - Long Code Arena: a Set of Benchmarks for Long-Context Code Models [75.70507534322336]
Long Code Arena is a suite of six benchmarks for code processing tasks that require project-wide context.
These tasks cover different aspects of code processing: library-based code generation, CI builds repair, project-level code completion, commit message generation, bug localization, and module summarization.
For each task, we provide a manually verified dataset for testing, an evaluation suite, and open-source baseline solutions.
arXiv Detail & Related papers (2024-06-17T14:58:29Z) - VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development.
We propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM)
We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge.
arXiv Detail & Related papers (2024-06-11T16:15:06Z) - MIML library: a Modular and Flexible Library for Multi-instance
Multi-label Learning [0.0]
MIML library is a Java software tool to develop, test, and compare classification algorithms for multi-instance multi-label (MIML) learning.
The library includes 43 algorithms and provides a specific format and facilities for data managing and partitioning, holdout and cross-validation methods.
arXiv Detail & Related papers (2024-02-12T20:46:47Z) - GitAgent: Facilitating Autonomous Agent with GitHub by Tool Extension [81.44231422624055]
A growing area of research focuses on Large Language Models (LLMs) equipped with external tools capable of performing diverse tasks.
In this paper, we introduce GitAgent, an agent capable of achieving the autonomous tool extension from GitHub.
arXiv Detail & Related papers (2023-12-28T15:47:30Z) - A Language Model of Java Methods with Train/Test Deduplication [5.529795221640365]
This tool demonstration presents a research toolkit for a language model of Java source code.
The target audience includes researchers studying problems at the granularity level of subroutines, statements, or variables in Java.
arXiv Detail & Related papers (2023-05-15T00:22:02Z) - SequeL: A Continual Learning Library in PyTorch and JAX [50.33956216274694]
SequeL is a library for Continual Learning that supports both PyTorch and JAX frameworks.
It provides a unified interface for a wide range of Continual Learning algorithms, including regularization-based approaches, replay-based approaches, and hybrid approaches.
We release SequeL as an open-source library, enabling researchers and developers to easily experiment and extend the library for their own purposes.
arXiv Detail & Related papers (2023-04-21T10:00:22Z) - JEMMA: An Extensible Java Dataset for ML4Code Applications [34.76698017961728]
We introduce JEMMA, a large-scale, diverse, and high-quality dataset targeted at Machine Learning for Source Code (ML4Code)
Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks.
JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties.
arXiv Detail & Related papers (2022-12-18T17:04:14Z) - Repro: An Open-Source Library for Improving the Reproducibility and
Usability of Publicly Available Research Code [74.28810048824519]
Repro is an open-source library which aims at improving the usability of research code.
It provides a lightweight Python API for running software released by researchers within Docker containers.
arXiv Detail & Related papers (2022-04-29T01:54:54Z) - Datasets: A Community Library for Natural Language Processing [55.48866401721244]
datasets is a community library for contemporary NLP.
The library includes more than 650 unique datasets, has more than 250 contributors, and has helped support a variety of novel cross-dataset research projects.
arXiv Detail & Related papers (2021-09-07T03:59:22Z) - LabelGit: A Dataset for Software Repositories Classification using
Attributed Dependency Graphs [11.523471275501857]
We create a new dataset of GitHub projects called LabelGit.
Our dataset uses direct information from the source code, like the dependency graph and source code neural representations from the identifiers.
We hope to aid the development of solutions that do not rely on proxies but use the entire source code to perform classification.
arXiv Detail & Related papers (2021-03-16T07:28:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.