LLM-based Content Classification Approach for GitHub Repositories by the README Files
- URL: http://arxiv.org/abs/2507.21899v1
- Date: Tue, 29 Jul 2025 15:09:38 GMT
- Title: LLM-based Content Classification Approach for GitHub Repositories by the README Files
- Authors: Malik Uzair Mehmood, Shahid Hussain, Wen Li Wang, Muhammad Usama Malik,
- Abstract summary: Large Language Models (LLMs) have shown great performance in many text-based tasks.<n>In this study, an approach is developed to fine-tune LLMs for automatically classifying different sections of GitHub files.<n>This approach outperforms current state-of-the-art methods and has achieved an overall F1 score of 0.98.
- Score: 2.212685917364911
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: GitHub is the world's most popular platform for storing, sharing, and managing code. Every GitHub repository has a README file associated with it. The README files should contain project-related information as per the recommendations of GitHub to support the usage and improvement of repositories. However, GitHub repository owners sometimes neglected these recommendations. This prevents a GitHub repository from reaching its full potential. This research posits that the comprehensiveness of a GitHub repository's README file significantly influences its adoption and utilization, with a lack of detail potentially hindering its full potential for widespread engagement and impact within the research community. Large Language Models (LLMs) have shown great performance in many text-based tasks including text classification, text generation, text summarization and text translation. In this study, an approach is developed to fine-tune LLMs for automatically classifying different sections of GitHub README files. Three encoder-only LLMs are utilized, including BERT, DistilBERT and RoBERTa. These pre-trained models are then fine-tuned based on a gold-standard dataset consisting of 4226 README file sections. This approach outperforms current state-of-the-art methods and has achieved an overall F1 score of 0.98. Moreover, we have also investigated the use of Parameter-Efficient Fine-Tuning (PEFT) techniques like Low-Rank Adaptation (LoRA) and shown an economical alternative to full fine-tuning without compromising much performance. The results demonstrate the potential of using LLMs in designing an automatic classifier for categorizing the content of GitHub README files. Consequently, this study contributes to the development of automated tools for GitHub repositories to improve their identifications and potential usages.
Related papers
- Can LLMs Write CI? A Study on Automatic Generation of GitHub Actions Configurations [0.0]
Continuous Integration services, such as GitHub Actions, require developers to write YAML-based configurations.<n>Despite the increasing use of Large Language Models (LLMs) to automate software engineering tasks, their ability to generate CI configurations remains underexplored.<n>This paper presents a preliminary study evaluating six LLMs for generating GitHub Actions configurations from natural language descriptions.
arXiv Detail & Related papers (2025-07-23T03:18:04Z) - SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving [90.32201622392137]
We present SwingArena, a competitive evaluation framework for Large Language Models (LLMs)<n>Unlike traditional static benchmarks, SwingArena models the collaborative process of software by pairing LLMs as iterations, who generate patches, and reviewers, who create test cases and verify the patches through continuous integration (CI) pipelines.
arXiv Detail & Related papers (2025-05-29T18:28:02Z) - SnipGen: A Mining Repository Framework for Evaluating LLMs for Code [51.07471575337676]
Language Models (LLMs) are trained on extensive datasets that include code repositories.<n> evaluating their effectiveness poses significant challenges due to the potential overlap between the datasets used for training and those employed for evaluation.<n>We introduce SnipGen, a comprehensive repository mining framework designed to leverage prompt engineering across various downstream tasks for code generation.
arXiv Detail & Related papers (2025-02-10T21:28:15Z) - SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution [56.9361004704428]
Large Language Models (LLMs) have demonstrated remarkable proficiency across a variety of complex tasks.<n>SWE-Fixer is a novel open-source framework designed to effectively and efficiently resolve GitHub issues.<n>We assess our approach on the SWE-Bench Lite and Verified benchmarks, achieving competitive performance among open-source models.
arXiv Detail & Related papers (2025-01-09T07:54:24Z) - ULLME: A Unified Framework for Large Language Model Embeddings with Generation-Augmented Learning [72.90823351726374]
We introduce the Unified framework for Large Language Model Embedding (ULLME), a flexible, plug-and-play implementation that enables bidirectional attention across various LLMs.
We also propose Generation-augmented Representation Learning (GRL), a novel fine-tuning method to boost LLMs for text embedding tasks.
To showcase our framework's flexibility and effectiveness, we release three pre-trained models from ULLME with different backbone architectures.
arXiv Detail & Related papers (2024-08-06T18:53:54Z) - MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution [47.850418420195304]
Large Language Models (LLMs) have shown promise in code generation but face difficulties in resolving GitHub issues.
We propose a novel Multi-Agent framework for GitHub Issue reSolution, MAGIS, consisting of four agents customized for software evolution.
arXiv Detail & Related papers (2024-03-26T17:57:57Z) - LEGION: Harnessing Pre-trained Language Models for GitHub Topic
Recommendations with Distribution-Balance Loss [3.946772434700026]
Current methods for automatic topic recommendation rely heavily on TF-IDF for encoding textual data.
This paper proposes Legion, a novel approach that leverages Pre-trained Language Models (PTMs) for recommending topics for GitHub repositories.
Our empirical evaluation on a benchmark dataset of real-world GitHub repositories shows that Legion can improve vanilla PTMs by up to 26% on recommending GitHubs topics.
arXiv Detail & Related papers (2024-03-09T10:49:31Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z) - Evaluating Transfer Learning for Simplifying GitHub READMEs [11.219774223416648]
This study explores the potential of text simplification techniques in the domain of software engineering to automatically simplify GitHub files.
We collected software-related pairs of GitHub files consisting of 14,588 entries, aligned difficult sentences with their simplified counterparts, and trained a Transformer-based model to automatically simplify difficult versions.
Using automated BLEU scores and human evaluation, we compared the performance of different transfer learning schemes and the baseline models without transfer learning.
arXiv Detail & Related papers (2023-08-19T08:20:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.