Automatically Categorising GitHub Repositories by Application Domain
- URL: http://arxiv.org/abs/2208.00269v1
- Date: Sat, 30 Jul 2022 16:27:16 GMT
- Title: Automatically Categorising GitHub Repositories by Application Domain
- Authors: Francisco Zanartu and Christoph Treude and Bruno Cartaxo and Hudson
Silva Borges and Pedro Moura and Markus Wagner and Gustavo Pinto
- Abstract summary: GitHub is the largest host of open source software on the Internet.
It is becoming increasingly hard to navigate the plethora of repositories which span a wide range of domains.
Past work has shown that taking the application domain into account is crucial for tasks such as predicting the popularity of a repository.
- Score: 14.265666415804025
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: GitHub is the largest host of open source software on the Internet. This
large, freely accessible database has attracted the attention of practitioners
and researchers alike. But as GitHub's growth continues, it is becoming
increasingly hard to navigate the plethora of repositories which span a wide
range of domains. Past work has shown that taking the application domain into
account is crucial for tasks such as predicting the popularity of a repository
and reasoning about project quality. In this work, we build on a previously
annotated dataset of 5,000 GitHub repositories to design an automated
classifier for categorising repositories by their application domain. The
classifier uses state-of-the-art natural language processing techniques and
machine learning to learn from multiple data sources and catalogue repositories
according to five application domains. We contribute with (1) an automated
classifier that can assign popular repositories to each application domain with
at least 70% precision, (2) an investigation of the approach's performance on
less popular repositories, and (3) a practical application of this approach to
answer how the adoption of software engineering practices differs across
application domains. Our work aims to help the GitHub community identify
repositories of interest and opens promising avenues for future work
investigating differences between repositories from different application
domains.
Related papers
- RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph [63.87660059104077]
We present RepoGraph, a plug-in module that manages a repository-level structure for modern AI software engineering solutions.
RepoGraph substantially boosts the performance of all systems, leading to a new state-of-the-art among open-source frameworks.
arXiv Detail & Related papers (2024-10-03T05:45:26Z) - GitHub Marketplace for Automation and Innovation in Software Production [2.0749231618270803]
GitHub Marketplace hosts automation tools to assist developers with the production of their GitHub-hosted projects.
This study explores the platform's characteristics, features, and policies and identifies common themes in production automation.
arXiv Detail & Related papers (2024-07-07T23:55:15Z) - How to Understand Whole Software Repository? [64.19431011897515]
An excellent understanding of the whole repository will be the critical path to Automatic Software Engineering (ASE)
We develop a novel method named RepoUnderstander by guiding agents to comprehensively understand the whole repositories.
To better utilize the repository-level knowledge, we guide the agents to summarize, analyze, and plan.
arXiv Detail & Related papers (2024-06-03T15:20:06Z) - ReGAL: Refactoring Programs to Discover Generalizable Abstractions [59.05769810380928]
Generalizable Abstraction Learning (ReGAL) is a method for learning a library of reusable functions via codeization.
We find that the shared function libraries discovered by ReGAL make programs easier to predict across diverse domains.
For CodeLlama-13B, ReGAL results in absolute accuracy increases of 11.5% on LOGO, 26.1% on date understanding, and 8.1% on TextCraft, outperforming GPT-3.5 in two of three domains.
arXiv Detail & Related papers (2024-01-29T18:45:30Z) - Query of CC: Unearthing Large Scale Domain-Specific Knowledge from
Public Corpora [104.16648246740543]
We propose an efficient data collection method based on large language models.
The method bootstraps seed information through a large language model and retrieves related data from public corpora.
It not only collects knowledge-related data for specific domains but unearths the data with potential reasoning procedures.
arXiv Detail & Related papers (2024-01-26T03:38:23Z) - GitAgent: Facilitating Autonomous Agent with GitHub by Tool Extension [81.44231422624055]
A growing area of research focuses on Large Language Models (LLMs) equipped with external tools capable of performing diverse tasks.
In this paper, we introduce GitAgent, an agent capable of achieving the autonomous tool extension from GitHub.
arXiv Detail & Related papers (2023-12-28T15:47:30Z) - Deep learning for table detection and structure recognition: A survey [49.09628624903334]
The goal of this survey is to provide a profound comprehension of the major developments in the field of Table Detection.
We provide an analysis of both classic and new applications in the field.
The datasets and source code of the existing models are organized to provide the reader with a compass on this vast literature.
arXiv Detail & Related papers (2022-11-15T19:42:27Z) - Topical: Learning Repository Embeddings from Source Code using Attention [3.110769442802435]
This paper presents Topical, a novel deep neural network for repository level embeddings.
The attention mechanism generates repository-level representations from source code, full dependency graphs, and script level textual data.
arXiv Detail & Related papers (2022-08-19T18:13:27Z) - GitHub Actions: The Impact on the Pull Request Process [7.047566396769727]
This study investigates how projects use GitHub Actions, what the developers discuss about them, and how project activity indicators change after their adoption.
Our results indicate that 1,489 out of 5,000 most popular repositories (almost 30% of our sample) adopt GitHub Actions.
Our findings also suggest that the adoption of GitHub Actions leads to more rejections of pull requests (PRs), more communication in accepted PRs and less communication in rejected PRs.
arXiv Detail & Related papers (2022-06-28T16:24:17Z) - GitRank: A Framework to Rank GitHub Repositories [0.0]
Open-source repositories provide wealth of information and are increasingly being used to build artificial intelligence (AI) based systems.
In this hackathon, we utilize known code quality measures and GrimoireLab toolkit to implement a framework, named GitRank, to rank open-source repositories on three different criteria.
arXiv Detail & Related papers (2022-05-04T23:42:30Z) - LabelGit: A Dataset for Software Repositories Classification using
Attributed Dependency Graphs [11.523471275501857]
We create a new dataset of GitHub projects called LabelGit.
Our dataset uses direct information from the source code, like the dependency graph and source code neural representations from the identifiers.
We hope to aid the development of solutions that do not rely on proxies but use the entire source code to perform classification.
arXiv Detail & Related papers (2021-03-16T07:28:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.