Natural Language Summarization Enables Multi-Repository Bug Localization by LLMs in Microservice Architectures
- URL: http://arxiv.org/abs/2512.05908v1
- Date: Fri, 05 Dec 2025 17:42:09 GMT
- Title: Natural Language Summarization Enables Multi-Repository Bug Localization by LLMs in Microservice Architectures
- Authors: Amirkia Rafiei Oskooei, S. Selcan Yukcu, Mehmet Cevheri Bozoglan, Mehmet S. Aktas,
- Abstract summary: This work shows that engineered natural language representations can be more effective than raw source code for scalable bug localization.<n> Evaluated on DNext, an industrial system with 46 repositories and 1.1M lines of code, our method achieves Pass@10 of 0.82 and MRR of 0.50.
- Score: 0.23332469289621782
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Bug localization in multi-repository microservice architectures is challenging due to the semantic gap between natural language bug reports and code, LLM context limitations, and the need to first identify the correct repository. We propose reframing this as a natural language reasoning task by transforming codebases into hierarchical NL summaries and performing NL-to-NL search instead of cross-modal retrieval. Our approach builds context-aware summaries at file, directory, and repository levels, then uses a two-phase search: first routing bug reports to relevant repositories, then performing top-down localization within those repositories. Evaluated on DNext, an industrial system with 46 repositories and 1.1M lines of code, our method achieves Pass@10 of 0.82 and MRR of 0.50, significantly outperforming retrieval baselines and agentic RAG systems like GitHub Copilot and Cursor. This work demonstrates that engineered natural language representations can be more effective than raw source code for scalable bug localization, providing an interpretable repository -> directory -> file search path, which is vital for building trust in enterprise AI tools by providing essential transparency.
Related papers
- Multi-CoLoR: Context-Aware Localization and Reasoning across Multi-Language Codebases [1.4216413758677147]
We present Multi-CoLoR, a framework for Context-aware localization and reasoning across Multi-Languages.<n>It integrates organizational knowledge retrieval with graph-based reasoning to traverse complex software ecosystems.
arXiv Detail & Related papers (2026-02-23T00:54:59Z) - GREPO: A Benchmark for Graph Neural Networks on Repository-Level Bug Localization [50.009407518866965]
Repository-level bug localization is a critical software engineering challenge.<n>GNNs offer a promising alternative due to their ability to model complex, repository-wide dependencies.<n>We introduce GREPO, the first GNN benchmark for repository-scale bug localization tasks.
arXiv Detail & Related papers (2026-02-14T23:22:15Z) - SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models [59.90381306452982]
evaluating large language models (LLMs) for software engineering has been limited by narrow task coverage, language bias, and insufficient alignment with real-world developer.<n>We introduce SWE-1, a comprehensive benchmark that unifies heterogeneous code-related evaluations into a structured and production-aligned framework.<n>SWE- spans 8 task types, 8 programming scenarios, and 10 programming languages, with 2000 high-quality instances curated from authentic GitHub pull requests.
arXiv Detail & Related papers (2025-11-07T18:01:32Z) - SweRank: Software Issue Localization with Code Ranking [109.3289316191729]
SweRank is an efficient retrieve-and-rerank framework for software issue localization.<n>We construct SweLoc, a large-scale dataset curated from public GitHub repositories.<n>We show that SweRank achieves state-of-the-art performance, outperforming both prior ranking models and costly agent-based systems.
arXiv Detail & Related papers (2025-05-07T19:44:09Z) - Enhancing repository-level software repair via repository-aware knowledge graphs [13.747293341707563]
Repository-level software repair faces challenges in bridging semantic gaps between issue descriptions and code patches.<n>Existing approaches, which rely on large language models (LLMs), are hindered by semantic ambiguities, limited understanding of structural context, and insufficient reasoning capabilities.<n>We propose a novel repository-aware knowledge graph (KG) that accurately links repository artifacts (issues and pull requests) and entities (files, classes, and functions)<n>A path-guided repair mechanism that leverages KG-mined paths, tracing through which allows us to augment contextual information along with explanations.
arXiv Detail & Related papers (2025-03-27T17:21:47Z) - ExecRepoBench: Multi-level Executable Code Completion Evaluation [45.963424627710765]
We introduce a novel framework for enhancing code completion in software development through the creation of a repository-level benchmark ExecRepoBench.<n>We present a multi-level grammar-based completion methodology conditioned on the abstract syntax tree to mask code fragments at various logical units.<n>Then, we fine-tune the open-source LLM with 7B parameters on Repo-Instruct to produce a strong code completion baseline model Qwen2.5-Coder-Instruct-C.
arXiv Detail & Related papers (2024-12-16T17:14:35Z) - Scalable, Validated Code Translation of Entire Projects using Large Language Models [13.059046327936393]
Large language models (LLMs) show promise in code translation due to their ability to generate idiomatic code.<n>Existing works have shown a drop in translation success rates for code exceeding around 100 lines.<n>We develop a modular approach to translation, where we partition the code into small code fragments which can be independently translated.<n>We show that we can consistently generate reliable Rust for projects up to 6,600 lines of code and 369 functions, with an average of 73% of functions successfully validated for I/O equivalence.
arXiv Detail & Related papers (2024-12-11T02:31:46Z) - RustRepoTrans: Repository-level Code Translation Benchmark Targeting Rust [50.65321080814249]
RustRepoTrans is the first repository-level context code translation benchmark targeting incremental translation.<n>We evaluate seven representative LLMs, analyzing their errors to assess limitations in complex translation scenarios.
arXiv Detail & Related papers (2024-11-21T10:00:52Z) - Class-Level Code Generation from Natural Language Using Iterative, Tool-Enhanced Reasoning over Repository [4.767858874370881]
We introduce RepoClassBench, a benchmark designed to rigorously evaluate LLMs in generating class-level code within real-world repositories.
RepoClassBench includes "Natural Language to Class generation" tasks across Java, Python & C# from a selection of repositories.
We introduce Retrieve-Repotools-Reflect (RRR), a novel approach that equips LLMs with static analysis tools to iteratively navigate & reason about repository-level context.
arXiv Detail & Related papers (2024-04-22T03:52:54Z) - IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators [49.903001442804594]
This work investigates the prospect of leveraging compiler intermediate representations (IR) to improve the multilingual capabilities of Code-LMs.
We first compile SLTrans, a parallel dataset consisting of nearly 4M self-contained source code files.
Next, we carry out continued causal language modelling training on SLTrans, forcing the Code-LMs to learn the IR language.
Our resulting models, dubbed IRCoder, display sizeable and consistent gains across a wide variety of code generation tasks and metrics.
arXiv Detail & Related papers (2024-03-06T17:52:08Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.