A Benchmark for Localizing Code and Non-Code Issues in Software Projects
- URL: http://arxiv.org/abs/2509.25242v1
- Date: Fri, 26 Sep 2025 06:05:20 GMT
- Title: A Benchmark for Localizing Code and Non-Code Issues in Software Projects
- Authors: Zejun Zhang, Jian Wang, Qingyun Yang, Yifan Pan, Yi Tang, Yi Li, Zhenchang Xing, Tian Zhang, Xuandong Li, Guoan Zhang,
- Abstract summary: We introduce MULocBench, a dataset of 1,100 issues from 46 popular GitHub Python projects.<n>Compared with existing benchmarks, MULocBench offers greater diversity in issue types, root causes, location scopes, and file types.<n>Using this benchmark, we assess the performance of state-of-the-art localization methods and five LLM-based prompting strategies.
- Score: 26.511673758202267
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate project localization (e.g., files and functions) for issue resolution is a critical first step in software maintenance. However, existing benchmarks for issue localization, such as SWE-Bench and LocBench, are limited. They focus predominantly on pull-request issues and code locations, ignoring other evidence and non-code files such as commits, comments, configurations, and documentation. To address this gap, we introduce MULocBench, a comprehensive dataset of 1,100 issues from 46 popular GitHub Python projects. Comparing with existing benchmarks, MULocBench offers greater diversity in issue types, root causes, location scopes, and file types, providing a more realistic testbed for evaluation. Using this benchmark, we assess the performance of state-of-the-art localization methods and five LLM-based prompting strategies. Our results reveal significant limitations in current techniques: even at the file level, performance metrics (Acc@5, F1) remain below 40%. This underscores the challenge of generalizing to realistic, multi-faceted issue resolution. To enable future research on project localization for issue resolution, we publicly release MULocBench at https://huggingface.co/datasets/somethingone/MULocBench.
Related papers
- Does SWE-Bench-Verified Test Agent Ability or Model Memory? [2.937612609787308]
SWE-Bench-Verified is a dataset comprising 500 issues.<n>This benchmark may overlap with model training data.<n>We test two Claude models that frequently appear in top-performing agents submitted to the benchmark.
arXiv Detail & Related papers (2025-12-11T02:11:06Z) - EDIT-Bench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits [72.23150343093447]
We introduce EDIT-Bench, a benchmark for evaluatingstructed code editing capabilities grounded in real-world usage.<n>EDIT-Bench comprises of 545 problems, multiple natural and programming languages, and a diverse set of real-world use cases.<n>We find that model performance varies across different categories of user instructions.
arXiv Detail & Related papers (2025-11-06T16:05:28Z) - The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason [1.6249398255272318]
We present empirical evidence that performance gains on SWE-Bench-Verified may be partially driven by memorization rather than genuine problem-solving.<n>We show that state-of-the-art models achieve up to 76% accuracy in identifying buggy file paths using only issue descriptions, without access to repository structure.<n>These findings raise concerns about the validity of existing results and underscore the need for more robust, contamination-resistant benchmarks.
arXiv Detail & Related papers (2025-06-14T00:25:26Z) - SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving [90.32201622392137]
We present SwingArena, a competitive evaluation framework for Large Language Models (LLMs)<n>Unlike traditional static benchmarks, SwingArena models the collaborative process of software by pairing LLMs as iterations, who generate patches, and reviewers, who create test cases and verify the patches through continuous integration (CI) pipelines.
arXiv Detail & Related papers (2025-05-29T18:28:02Z) - SweRank: Software Issue Localization with Code Ranking [109.3289316191729]
SweRank is an efficient retrieve-and-rerank framework for software issue localization.<n>We construct SweLoc, a large-scale dataset curated from public GitHub repositories.<n>We show that SweRank achieves state-of-the-art performance, outperforming both prior ranking models and costly agent-based systems.
arXiv Detail & Related papers (2025-05-07T19:44:09Z) - Information Density Principle for MLLM Benchmarks [59.88484827926759]
We propose a critical principle of Information Density, which examines how much insight a benchmark can provide for the development of MLLMs.<n>Through a comprehensive analysis of more than 10,000 samples, we measured the information density of 19 MLLM benchmarks.<n>Experiments show that using the latest benchmarks in testing can provide more insight compared to previous ones, but there is still room for improvement in their information density.
arXiv Detail & Related papers (2025-03-13T05:58:41Z) - How Should We Build A Benchmark? Revisiting 274 Code-Related Benchmarks For LLMs [60.25940747590386]
We propose How2Bench, which is comprised of a 55-criteria checklist as a set of guidelines to govern the development of code-related benchmarks comprehensively.<n>We profiled 274 benchmarks released within the past decade and found concerning issues.<n>Nearly 70% of the benchmarks did not take measures for data quality assurance; over 10% did not even open source or only partially open source.
arXiv Detail & Related papers (2025-01-18T09:51:57Z) - Integrating Various Software Artifacts for Better LLM-based Bug Localization and Program Repair [3.617293786745078]
We propose DEVLoRe to use issue content (description and message) and stack error traces to localize buggy methods.<n>By incorporating different artifacts, DEVLoRe successfully locates 49.3% and 47.6% of single and non-single buggy methods.<n>This outperforms current state-of-the-art APR methods.
arXiv Detail & Related papers (2024-12-05T06:21:31Z) - BLAZE: Cross-Language and Cross-Project Bug Localization via Dynamic Chunking and Hard Example Learning [1.9854146581797698]
BLAZE is an approach that employs dynamic chunking and hard example learning.<n>It fine-tunes a GPT-based model using challenging bug cases to enhance cross-project and cross-language bug localization.<n>BLAZE achieves up to an increase of 120% in Top 1 accuracy, 144% in Mean Average Precision (MAP), and 100% in Mean Reciprocal Rank (MRR)
arXiv Detail & Related papers (2024-07-24T20:44:36Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - Long Code Arena: a Set of Benchmarks for Long-Context Code Models [75.70507534322336]
Long Code Arena is a suite of six benchmarks for code processing tasks that require project-wide context.
These tasks cover different aspects of code processing: library-based code generation, CI builds repair, project-level code completion, commit message generation, bug localization, and module summarization.
For each task, we provide a manually verified dataset for testing, an evaluation suite, and open-source baseline solutions.
arXiv Detail & Related papers (2024-06-17T14:58:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.