LabelGit: A Dataset for Software Repositories Classification using
Attributed Dependency Graphs
- URL: http://arxiv.org/abs/2103.08890v1
- Date: Tue, 16 Mar 2021 07:28:58 GMT
- Title: LabelGit: A Dataset for Software Repositories Classification using
Attributed Dependency Graphs
- Authors: Cezar Sas, Andrea Capiluppi
- Abstract summary: We create a new dataset of GitHub projects called LabelGit.
Our dataset uses direct information from the source code, like the dependency graph and source code neural representations from the identifiers.
We hope to aid the development of solutions that do not rely on proxies but use the entire source code to perform classification.
- Score: 11.523471275501857
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Software repository hosting services contain large amounts of open-source
software, with GitHub hosting more than 100 million repositories, from new to
established ones. Given this vast amount of projects, there is a pressing need
for a search based on the software's content and features. However, even though
GitHub offers various solutions to aid software discovery, most repositories do
not have any labels, reducing the utility of search and topic-based analysis.
Moreover, classifying software modules is also getting more importance given
the increase in Component-Based Software Development. However, previous work
focused on software classification using keyword-based approaches or proxies
for the project (e.g., README), which is not always available. In this work, we
create a new annotated dataset of GitHub Java projects called LabelGit. Our
dataset uses direct information from the source code, like the dependency graph
and source code neural representations from the identifiers. Using this
dataset, we hope to aid the development of solutions that do not rely on
proxies but use the entire source code to perform classification.
Related papers
- RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph [63.87660059104077]
We present RepoGraph, a plug-in module that manages a repository-level structure for modern AI software engineering solutions.
RepoGraph substantially boosts the performance of all systems, leading to a new state-of-the-art among open-source frameworks.
arXiv Detail & Related papers (2024-10-03T05:45:26Z) - Long Code Arena: a Set of Benchmarks for Long-Context Code Models [75.70507534322336]
Long Code Arena is a suite of six benchmarks for code processing tasks that require project-wide context.
These tasks cover different aspects of code processing: library-based code generation, CI builds repair, project-level code completion, commit message generation, bug localization, and module summarization.
For each task, we provide a manually verified dataset for testing, an evaluation suite, and open-source baseline solutions.
arXiv Detail & Related papers (2024-06-17T14:58:29Z) - How to Understand Whole Software Repository? [64.19431011897515]
An excellent understanding of the whole repository will be the critical path to Automatic Software Engineering (ASE)
We develop a novel method named RepoUnderstander by guiding agents to comprehensively understand the whole repositories.
To better utilize the repository-level knowledge, we guide the agents to summarize, analyze, and plan.
arXiv Detail & Related papers (2024-06-03T15:20:06Z) - RepoCoder: Repository-Level Code Completion Through Iterative Retrieval
and Generation [96.75695811963242]
RepoCoder is a framework to streamline the repository-level code completion process.
It incorporates a similarity-based retriever and a pre-trained code language model.
It consistently outperforms the vanilla retrieval-augmented code completion approach.
arXiv Detail & Related papers (2023-03-22T13:54:46Z) - Deep learning for table detection and structure recognition: A survey [49.09628624903334]
The goal of this survey is to provide a profound comprehension of the major developments in the field of Table Detection.
We provide an analysis of both classic and new applications in the field.
The datasets and source code of the existing models are organized to provide the reader with a compass on this vast literature.
arXiv Detail & Related papers (2022-11-15T19:42:27Z) - Automatically Categorising GitHub Repositories by Application Domain [14.265666415804025]
GitHub is the largest host of open source software on the Internet.
It is becoming increasingly hard to navigate the plethora of repositories which span a wide range of domains.
Past work has shown that taking the application domain into account is crucial for tasks such as predicting the popularity of a repository.
arXiv Detail & Related papers (2022-07-30T16:27:16Z) - Semantically-enhanced Topic Recommendation System for Software Projects [2.0625936401496237]
Tagging software repositories with relevant topics can be exploited for facilitating various downstream tasks.
There have been efforts on recommending topics for software projects, however, the semantic relationships among these topics have not been exploited so far.
We propose two recommender models for tagging software projects that incorporate the semantic relationship among topics.
arXiv Detail & Related papers (2022-05-31T19:54:42Z) - GitRank: A Framework to Rank GitHub Repositories [0.0]
Open-source repositories provide wealth of information and are increasingly being used to build artificial intelligence (AI) based systems.
In this hackathon, we utilize known code quality measures and GrimoireLab toolkit to implement a framework, named GitRank, to rank open-source repositories on three different criteria.
arXiv Detail & Related papers (2022-05-04T23:42:30Z) - Predicting Issue Types on GitHub [8.791809365994682]
Ticket Tagger is a GitHub app analyzing the issue title and description through machine learning techniques.
We empirically evaluated the tool's prediction performance on about 30,000 GitHub issues.
arXiv Detail & Related papers (2021-07-21T08:14:48Z) - Benchmarking Graph Neural Networks [75.42159546060509]
Graph neural networks (GNNs) have become the standard toolkit for analyzing and learning from data on graphs.
For any successful field to become mainstream and reliable, benchmarks must be developed to quantify progress.
GitHub repository has reached 1,800 stars and 339 forks, which demonstrates the utility of the proposed open-source framework.
arXiv Detail & Related papers (2020-03-02T15:58:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.