Related papers: SWE-QA: Can Language Models Answer Repository-level Code Questions?

SWE-QA: Can Language Models Answer Repository-level Code Questions?

URL: http://arxiv.org/abs/2509.14635v1
Date: Thu, 18 Sep 2025 05:25:32 GMT
Title: SWE-QA: Can Language Models Answer Repository-level Code Questions?
Authors: Weihan Peng, Yuling Shi, Yuhang Wang, Xinyun Zhang, Beijun Shen, Xiaodong Gu,
Abstract summary: SWE-QA is a repository-level code question answering (QA) benchmark designed to facilitate research on automated QA systems in realistic code environments.<n>SWE-QA involves 576 high-quality question-answer pairs spanning diverse categories, including intention understanding, cross-file reasoning, and multi-hop dependency analysis.<n>We develop SWE-QA-Agent, an agentic framework in which LLM agents reason and act to find answers automatically.
Score: 23.0514975768053
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding and reasoning about entire software repositories is an essential capability for intelligent software engineering tools. While existing benchmarks such as CoSQA and CodeQA have advanced the field, they predominantly focus on small, self-contained code snippets. These setups fail to capture the complexity of real-world repositories, where effective understanding and reasoning often require navigating multiple files, understanding software architecture, and grounding answers in long-range code dependencies. In this paper, we present SWE-QA, a repository-level code question answering (QA) benchmark designed to facilitate research on automated QA systems in realistic code environments. SWE-QA involves 576 high-quality question-answer pairs spanning diverse categories, including intention understanding, cross-file reasoning, and multi-hop dependency analysis. To construct SWE-QA, we first crawled 77,100 GitHub issues from 11 popular repositories. Based on an analysis of naturally occurring developer questions extracted from these issues, we developed a two-level taxonomy of repository-level questions and constructed a set of seed questions for each category. For each category, we manually curated and validated questions and collected their corresponding answers. As a prototype application, we further develop SWE-QA-Agent, an agentic framework in which LLM agents reason and act to find answers automatically. We evaluate six advanced LLMs on SWE-QA under various context augmentation strategies. Experimental results highlight the promise of LLMs, particularly our SWE-QA-Agent framework, in addressing repository-level QA, while also revealing open challenges and pointing to future research directions.

Related papers

Inferential Question Answering [67.54465021408724]
We introduce Inferential QA -- a new task that challenges models to infer answers from answer-supporting passages which provide only clues.<n>To study this problem, we construct QUIT (QUestions requiring Inference from Texts) dataset, comprising 7,401 questions and 2.4M passages.<n>We show that methods effective on traditional QA tasks struggle in inferential QA: retrievers underperform, rerankers offer limited gains, and fine-tuning provides inconsistent improvements.
arXiv Detail & Related papers (2026-02-01T14:02:43Z)
The benefits of query-based KGQA systems for complex and temporal questions in LLM era [55.20230501807337]
Large language models excel in question-answering (QA) yet still struggle with multi-hop reasoning and temporal questions.<n> Query-based knowledge graph QA (KGQA) offers a modular alternative by generating executable queries instead of direct answers.<n>We explore multi-stage query-based framework for WikiData QA, proposing multi-stage approach that enhances performance on challenging multi-hop and temporal benchmarks.
arXiv Detail & Related papers (2025-07-16T06:41:03Z)
CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance [18.886738819470086]
We introduce CodeAssistBench (CAB), the first benchmark framework for evaluating multi-turn programming assistance.<n>Unlike existing programming Q&A benchmarks, CAB automatically generates scalable datasets from question-related GitHub issues.<n>Using this framework, we constructed a test set of 3,286 real-world programming questions across 231 repositories.
arXiv Detail & Related papers (2025-07-14T17:19:00Z)
CoReQA: Uncovering Potentials of Language Models in Code Repository Question Answering [12.431784613373523]
We introduce CoReQA, a benchmark for Code Repository-level question answering.<n>CoReQA was constructed from GitHub issues and comments from 176 popular repositories across four programming languages.<n>We show that state-of-the-art proprietary and long-context models struggle to address repository-level questions effectively.
arXiv Detail & Related papers (2025-01-07T00:24:07Z)
Grounding-IQA: Multimodal Language Grounding Model for Image Quality Assessment [69.07445098168344]
We introduce a new image quality assessment (IQA) task paradigm, grounding-IQA.<n>Grounding-IQA comprises two subtasks: grounding-IQA-description (GIQA-DES) and visual question answering (GIQA-VQA)<n>To realize grounding-IQA, we construct a corresponding dataset, GIQA-160K, through our proposed automated annotation pipeline.<n>Experiments demonstrate that our proposed task paradigm, dataset, and benchmark facilitate the more fine-grained IQA application.
arXiv Detail & Related papers (2024-11-26T09:03:16Z)
DEXTER: A Benchmark for open-domain Complex Question Answering using LLMs [3.24692739098077]
Open-domain complex Question Answering (QA) is a difficult task with challenges in evidence retrieval and reasoning. We evaluate state-of-the-art pre-trained dense and sparse retrieval models in an open-domain setting. We observe that late interaction models and surprisingly lexical models like BM25 perform well compared to other pre-trained dense retrieval models.
arXiv Detail & Related papers (2024-06-24T22:09:50Z)
Multi-LLM QA with Embodied Exploration [55.581423861790945]
We investigate the use of Multi-Embodied LLM Explorers (MELE) for question-answering in an unknown environment. Multiple LLM-based agents independently explore and then answer queries about a household environment. We analyze different aggregation methods to generate a single, final answer for each query.
arXiv Detail & Related papers (2024-06-16T12:46:40Z)
In-Context Ability Transfer for Question Decomposition in Complex QA [6.745884231594893]
We propose icat (In-Context Ability Transfer) to solve complex question-answering tasks. We transfer the ability to decompose complex questions to simpler questions or generate step-by-step rationales to LLMs. We conduct large-scale experiments on a variety of complex QA tasks involving numerical reasoning, compositional complex QA, and heterogeneous complex QA.
arXiv Detail & Related papers (2023-10-26T11:11:07Z)
ProQA: Structural Prompt-based Pre-training for Unified Question Answering [84.59636806421204]
ProQA is a unified QA paradigm that solves various tasks through a single model. It concurrently models the knowledge generalization for all QA tasks while keeping the knowledge customization for every specific QA task. ProQA consistently boosts performance on both full data fine-tuning, few-shot learning, and zero-shot testing scenarios.
arXiv Detail & Related papers (2022-05-09T04:59:26Z)
Retrieving and Reading: A Comprehensive Survey on Open-domain Question Answering [62.88322725956294]
We review the latest research trends in OpenQA, with particular attention to systems that incorporate neural MRC techniques. We introduce modern OpenQA architecture named Retriever-Reader'' and analyze the various systems that follow this architecture. We then discuss key challenges to developing OpenQA systems and offer an analysis of benchmarks that are commonly used.
arXiv Detail & Related papers (2021-01-04T04:47:46Z)
Unsupervised Question Decomposition for Question Answering [102.56966847404287]
We propose an algorithm for One-to-N Unsupervised Sequence Sequence (ONUS) that learns to map one hard, multi-hop question to many simpler, single-hop sub-questions. We show large QA improvements on HotpotQA over a strong baseline on the original, out-of-domain, and multi-hop dev sets.
arXiv Detail & Related papers (2020-02-22T19:40:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.