LEGION: Harnessing Pre-trained Language Models for GitHub Topic
Recommendations with Distribution-Balance Loss
- URL: http://arxiv.org/abs/2403.05873v1
- Date: Sat, 9 Mar 2024 10:49:31 GMT
- Title: LEGION: Harnessing Pre-trained Language Models for GitHub Topic
Recommendations with Distribution-Balance Loss
- Authors: Yen-Trang Dang, Thanh-Le Cong, Phuc-Thanh Nguyen, Anh M. T. Bui,
Phuong T. Nguyen, Bach Le, Quyet-Thang Huynh
- Abstract summary: Current methods for automatic topic recommendation rely heavily on TF-IDF for encoding textual data.
This paper proposes Legion, a novel approach that leverages Pre-trained Language Models (PTMs) for recommending topics for GitHub repositories.
Our empirical evaluation on a benchmark dataset of real-world GitHub repositories shows that Legion can improve vanilla PTMs by up to 26% on recommending GitHubs topics.
- Score: 3.946772434700026
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Open-source development has revolutionized the software industry by promoting
collaboration, transparency, and community-driven innovation. Today, a vast
amount of various kinds of open-source software, which form networks of
repositories, is often hosted on GitHub - a popular software development
platform. To enhance the discoverability of the repository networks, i.e.,
groups of similar repositories, GitHub introduced repository topics in 2017
that enable users to more easily explore relevant projects by type, technology,
and more. It is thus crucial to accurately assign topics for each GitHub
repository. Current methods for automatic topic recommendation rely heavily on
TF-IDF for encoding textual data, presenting challenges in understanding
semantic nuances. This paper addresses the limitations of existing techniques
by proposing Legion, a novel approach that leverages Pre-trained Language
Models (PTMs) for recommending topics for GitHub repositories. The key novelty
of Legion is three-fold. First, Legion leverages the extensive capabilities of
PTMs in language understanding to capture contextual information and semantic
meaning in GitHub repositories. Second, Legion overcomes the challenge of
long-tailed distribution, which results in a bias toward popular topics in
PTMs, by proposing a Distribution-Balanced Loss (DB Loss) to better train the
PTMs. Third, Legion employs a filter to eliminate vague recommendations,
thereby improving the precision of PTMs. Our empirical evaluation on a
benchmark dataset of real-world GitHub repositories shows that Legion can
improve vanilla PTMs by up to 26% on recommending GitHubs topics. Legion also
can suggest GitHub topics more precisely and effectively than the
state-of-the-art baseline with an average improvement of 20% and 5% in terms of
Precision and F1-score, respectively.
Related papers
- Visual Analysis of GitHub Issues to Gain Insights [2.9051263101214566]
This paper presents a prototype web application that generates visualizations to offer insights into issue timelines.
It focuses on the lifecycle of issues and depicts vital information to enhance users' understanding of development patterns.
arXiv Detail & Related papers (2024-07-30T15:17:57Z) - CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation.
We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks.
We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z) - How to Understand Whole Software Repository? [64.19431011897515]
An excellent understanding of the whole repository will be the critical path to Automatic Software Engineering (ASE)
We develop a novel method named RepoUnderstander by guiding agents to comprehensively understand the whole repositories.
To better utilize the repository-level knowledge, we guide the agents to summarize, analyze, and plan.
arXiv Detail & Related papers (2024-06-03T15:20:06Z) - MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution [47.850418420195304]
Large Language Models (LLMs) have shown promise in code generation but face difficulties in resolving GitHub issues.
We propose a novel Multi-Agent framework for GitHub Issue reSolution, MAGIS, consisting of four agents customized for software evolution.
arXiv Detail & Related papers (2024-03-26T17:57:57Z) - SoTaNa: The Open-Source Software Development Assistant [81.86136560157266]
SoTaNa is an open-source software development assistant.
It generates high-quality instruction-based data for the domain of software engineering.
It employs a parameter-efficient fine-tuning approach to enhance the open-source foundation model, LLaMA.
arXiv Detail & Related papers (2023-08-25T14:56:21Z) - From Commit Message Generation to History-Aware Commit Message
Completion [49.175498083165884]
We argue that if we could shift the focus from commit message generation to commit message completion, we could significantly improve the quality and the personal nature of the resulting commit messages.
Since the existing datasets lack historical data, we collect and share a novel dataset called CommitChronicle, containing 10.7M commits across 20 programming languages.
Our results show that in some contexts, commit message completion shows better results than generation, and that while in general GPT-3.5-turbo performs worse, it shows potential for long and detailed messages.
arXiv Detail & Related papers (2023-08-15T09:10:49Z) - CommitBART: A Large Pre-trained Model for GitHub Commits [8.783518592487248]
We present CommitBART, a large pre-trained encoder-decoder Transformer model for GitHub commits.
The model is pre-trained by three categories (i.e., denoising objectives, cross-modal generation and contrastive learning) for six pre-training tasks to learn commit fragment representations.
Experiments on these tasks demonstrate that CommitBART significantly outperforms previous pre-trained works for code.
arXiv Detail & Related papers (2022-08-17T06:35:57Z) - Automatically Categorising GitHub Repositories by Application Domain [14.265666415804025]
GitHub is the largest host of open source software on the Internet.
It is becoming increasingly hard to navigate the plethora of repositories which span a wide range of domains.
Past work has shown that taking the application domain into account is crucial for tasks such as predicting the popularity of a repository.
arXiv Detail & Related papers (2022-07-30T16:27:16Z) - GitRank: A Framework to Rank GitHub Repositories [0.0]
Open-source repositories provide wealth of information and are increasingly being used to build artificial intelligence (AI) based systems.
In this hackathon, we utilize known code quality measures and GrimoireLab toolkit to implement a framework, named GitRank, to rank open-source repositories on three different criteria.
arXiv Detail & Related papers (2022-05-04T23:42:30Z) - Contrastive Code Representation Learning [95.86686147053958]
We show that the popular reconstruction-based BERT model is sensitive to source code edits, even when the edits preserve semantics.
We propose ContraCode: a contrastive pre-training task that learns code functionality, not form.
arXiv Detail & Related papers (2020-07-09T17:59:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.