DRAGON: Robust Classification for Very Large Collections of Software Repositories
- URL: http://arxiv.org/abs/2602.09071v1
- Date: Mon, 09 Feb 2026 10:27:24 GMT
- Title: DRAGON: Robust Classification for Very Large Collections of Software Repositories
- Authors: Stefano Balla, Stefano Zacchiroli, Thomas Degueule, Jean-Rémy Falleri, Romain Robbes,
- Abstract summary: We present DRAGON, a repository designed for very large and diverse software collections.<n>DRAGON operates entirely lightweight signals commonly stored in version control systems.<n>As a byproduct of developing DRAGON, we also release the largest open dataset to date for repository, consisting of 825 thousand repositories with associated ground-truth topics.
- Score: 7.11989492494202
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The ability to automatically classify source code repositories with ''topics'' that reflect their content and purpose is very useful, especially when navigating or searching through large software collections. However, existing approaches often rely heavily on README files and other metadata, which are frequently missing, limiting their applicability in real-world large-scale settings. We present DRAGON, a repository classifier designed for very large and diverse software collections. It operates entirely on lightweight signals commonly stored in version control systems: file and directory names, and optionally the README when available. In repository classification at scale, DRAGON improves F1@5 from 54.8% to 60.8%, surpassing the state of the art. DRAGON remains effective even when README files are absent, with performance degrading by only 6% w.r.t. when they are present. This robustness makes it practical for real-world settings where documentation is sparse or inconsistent. Furthermore, many of the remaining classification errors are near misses, where predicted labels are semantically close to the correct topics. This property increases the practical value of the predictions in real-world software collections, where suggesting a few related topics can still guide search and discovery. As a byproduct of developing DRAGON, we also release the largest open dataset to date for repository classification, consisting of 825 thousand repositories with associated ground-truth topics, sourced from the Software Heritage archive, providing a foundation for future large-scale and language-agnostic research on software repository understanding.
Related papers
- GREPO: A Benchmark for Graph Neural Networks on Repository-Level Bug Localization [50.009407518866965]
Repository-level bug localization is a critical software engineering challenge.<n>GNNs offer a promising alternative due to their ability to model complex, repository-wide dependencies.<n>We introduce GREPO, the first GNN benchmark for repository-scale bug localization tasks.
arXiv Detail & Related papers (2026-02-14T23:22:15Z) - Improving Code Localization with Repository Memory [33.423769985220005]
We introduce tools that allow the agent to retrieve from a non-parametric memory encompassing recent historical commits and linked issues.<n>We demonstrate that augmenting such a memory can significantly improve LocAgent, a state-of-the-art localization framework.
arXiv Detail & Related papers (2025-10-01T15:10:15Z) - Meta-RAG on Large Codebases Using Code Summarization [11.415083231118142]
Large Language Model (LLM) systems have been at the forefront of applied Artificial Intelligence (AI) research in a multitude of domains.<n>We propose a multi-agent system to localize bugs in large pre-existings using information retrieval and LLMs.<n>Our system introduces a novel Retrieval Augmented Generation (RAG) approach, Meta-RAG, where we utilize summaries to condenses by an average of 79.8%, into a compact, structured, natural language representation.
arXiv Detail & Related papers (2025-08-04T17:01:10Z) - LLM-based Content Classification Approach for GitHub Repositories by the README Files [2.212685917364911]
Large Language Models (LLMs) have shown great performance in many text-based tasks.<n>In this study, an approach is developed to fine-tune LLMs for automatically classifying different sections of GitHub files.<n>This approach outperforms current state-of-the-art methods and has achieved an overall F1 score of 0.98.
arXiv Detail & Related papers (2025-07-29T15:09:38Z) - SweRank: Software Issue Localization with Code Ranking [109.3289316191729]
SweRank is an efficient retrieve-and-rerank framework for software issue localization.<n>We construct SweLoc, a large-scale dataset curated from public GitHub repositories.<n>We show that SweRank achieves state-of-the-art performance, outperforming both prior ranking models and costly agent-based systems.
arXiv Detail & Related papers (2025-05-07T19:44:09Z) - Repository-level Code Search with Neural Retrieval Methods [25.222964965449286]
We define the task of repository-level code search as retrieving the set of files from the current state of a code repository that are most relevant to addressing a user's question or bug.<n>The proposed approach combines BM25-based retrieval over commit messages with neural reranking using CodeBERT to identify the most pertinent files.<n> Experiments on a new dataset created from 7 popular open-source repositories demonstrate substantial improvements of up to 80% in MAP, MRR and P@1 over the BM25 baseline.
arXiv Detail & Related papers (2025-02-10T21:59:01Z) - An Empirical Study of Dotfiles Repositories Containing User-Specific Configuration Files [1.7556600627464058]
Hundreds of thousands choose to publicly host their repositories on GitHub.<n>We collected and analyzed publicly-hosted dotfiles repositories on GitHub.<n>We found that 25.8% of the top 500 most-starred GitHub users maintain some form of publicly accessible dotfiles repository.
arXiv Detail & Related papers (2025-01-30T18:32:46Z) - CoIR: A Comprehensive Benchmark for Code Information Retrieval Models [52.61625841028781]
COIR (Code Information Retrieval Benchmark) is a robust and comprehensive benchmark designed to assess code retrieval capabilities.<n>COIR comprises ten meticulously curated code datasets, spanning eight distinctive retrieval tasks across seven diverse domains.<n>We evaluate nine widely used retrieval models using COIR, uncovering significant difficulties in performing code retrieval tasks even with state-of-the-art systems.
arXiv Detail & Related papers (2024-07-03T07:58:20Z) - Alibaba LingmaAgent: Improving Automated Issue Resolution via Comprehensive Repository Exploration [64.19431011897515]
This paper presents Alibaba LingmaAgent, a novel Automated Software Engineering method designed to comprehensively understand and utilize whole software repositories for issue resolution.<n>Our approach introduces a top-down method to condense critical repository information into a knowledge graph, reducing complexity, and employs a Monte Carlo tree search based strategy.<n>In production deployment and evaluation at Alibaba Cloud, LingmaAgent automatically resolved 16.9% of in-house issues faced by development engineers, and solved 43.3% of problems after manual intervention.
arXiv Detail & Related papers (2024-06-03T15:20:06Z) - Deep learning for table detection and structure recognition: A survey [49.09628624903334]
The goal of this survey is to provide a profound comprehension of the major developments in the field of Table Detection.
We provide an analysis of both classic and new applications in the field.
The datasets and source code of the existing models are organized to provide the reader with a compass on this vast literature.
arXiv Detail & Related papers (2022-11-15T19:42:27Z) - Omni-DETR: Omni-Supervised Object Detection with Transformers [165.4190908259015]
We consider the problem of omni-supervised object detection, which can use unlabeled, fully labeled and weakly labeled annotations.
Under this unified architecture, different types of weak labels can be leveraged to generate accurate pseudo labels.
We have found that weak annotations can help to improve detection performance and a mixture of them can achieve a better trade-off between annotation cost and accuracy.
arXiv Detail & Related papers (2022-03-30T06:36:09Z) - TAO: A Large-Scale Benchmark for Tracking Any Object [95.87310116010185]
Tracking Any Object dataset consists of 2,907 high resolution videos, captured in diverse environments, which are half a minute long on average.
We ask annotators to label objects that move at any point in the video, and give names to them post factum.
Our vocabulary is both significantly larger and qualitatively different from existing tracking datasets.
arXiv Detail & Related papers (2020-05-20T21:07:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.