SimpleDevQA: Benchmarking Large Language Models on Development Knowledge QA
- URL: http://arxiv.org/abs/2512.08867v1
- Date: Tue, 09 Dec 2025 17:58:36 GMT
- Title: SimpleDevQA: Benchmarking Large Language Models on Development Knowledge QA
- Authors: Jing Zhang, Lianghong Guo, Yanlin Wang, Mingwei Liu, Jiachi Chen, Yuchi Ma, Ensheng Shi, Terry Yue Zhuo, Hongyu Zhang, Zibin Zheng,
- Abstract summary: The Dev Knowledge QA task accounts for 39.6% of interactions.<n>Only 27.5% of real Dev Knowledge QA dialogues focus on code understanding.<n>Only 17.1% of real-world Dev Knowledge QA dialogues can be used for constructing a benchmark.
- Score: 58.75982433502236
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Development Knowledge Question Answering (Dev Knowledge QA) task aims to provide natural language answers to knowledge-seeking questions during software development. To investigate its importance and to what extent it has been explored, we analyze real user-LLM dialogues from WildChat and find that: (1) The Dev Knowledge QA task accounts for 39.6% of interactions(highest among all tasks), revealing broad knowledge needs beyond code generation (32.3%). (2) Only 27.5% of real Dev Knowledge QA dialogues focus on code understanding, leaving out development knowledge-seeking. (3) Only 17.1% of real-world Dev Knowledge QA dialogues can be used for constructing a benchmark. Existing benchmarks have two primary limitations for evaluating the Dev Knowledge QA capability of LLMs. First, existing benchmarks offer a limited development knowledge scope, mainly focusing on code understanding and neglecting broader knowledge during development. Second, some benchmarks are not built from real user queries. To bridge this gap, we design a three-phase pipeline that transforms real-world dialogue into simple development knowledge-seeking QA pairs. Through this pipeline, we introduce SimpleDevQA, a multilingual benchmark derived from real user dialogues. It contains 2,740 QA pairs in three languages (English, Chinese, and Russian), and focuses on questions with unique, short, and verifiable answers for accurate and simple evaluation. Experiments show that: Code LLMs generally outperform general LLMs of similar scale; Knowledge injection with the Retrieval-Augmented Generation (RAG) strategy can boost LLM accuracy by 11.3% on average; LLMs show systematic overconfidence in Dev Knowledge QA, and the answering accuracy of LLMs shows a positive correlation with their stated confidence; Generally, LLMs with stronger code generation performance also exhibit stronger performance in Dev Knowledge QA.
Related papers
- SPARQL Query Generation with LLMs: Measuring the Impact of Training Data Memorization and Knowledge Injection [81.78173888579941]
Large Language Models (LLMs) are considered a well-suited method to increase the quality of the question-answering functionality.<n>LLMs are trained on web data, where researchers have no control over whether the benchmark or the knowledge graph was already included in the training data.<n>This paper introduces a novel method that evaluates the quality of LLMs by generating a SPARQL query from a natural-language question.
arXiv Detail & Related papers (2025-07-18T12:28:08Z) - Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering [73.73820209993515]
We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs)<n>Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability.<n>Results show significant performance differences between the two domains.
arXiv Detail & Related papers (2025-05-22T12:27:02Z) - LOVA3: Learning to Visual Question Answering, Asking and Assessment [61.51687164769517]
Question answering, asking, and assessment are three innate human traits crucial for understanding the world and acquiring knowledge.<n>Current Multimodal Large Language Models (MLLMs) primarily focus on question answering, often neglecting the full potential of questioning and assessment skills.<n>We introduce LOVA3, an innovative framework named "Learning tO Visual question Answering, Asking and Assessment"
arXiv Detail & Related papers (2024-05-23T18:21:59Z) - FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks.
We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z) - KGQuiz: Evaluating the Generalization of Encoded Knowledge in Large Language Models [39.554274096542244]
KGQuiz is a knowledge-intensive benchmark to investigate the knowledge generalization abilities of large language models.
We evaluate 10 open-source and black-box LLMs on the KGQuiz benchmark across the five knowledge-intensive tasks and knowledge domains.
We envision KGQuiz as a testbed to analyze such nuanced variations in performance across domains and task formats.
arXiv Detail & Related papers (2023-10-15T04:00:36Z) - Beyond Factuality: A Comprehensive Evaluation of Large Language Models
as Knowledge Generators [78.63553017938911]
Large language models (LLMs) outperform information retrieval techniques for downstream knowledge-intensive tasks.
However, community concerns abound regarding the factuality and potential implications of using this uncensored knowledge.
We introduce CONNER, designed to evaluate generated knowledge from six important perspectives.
arXiv Detail & Related papers (2023-10-11T08:22:37Z) - Investigating the Factual Knowledge Boundary of Large Language Models with Retrieval Augmentation [109.8527403904657]
We show that large language models (LLMs) possess unwavering confidence in their knowledge and cannot handle the conflict between internal and external knowledge well.
Retrieval augmentation proves to be an effective approach in enhancing LLMs' awareness of knowledge boundaries.
We propose a simple method to dynamically utilize supporting documents with our judgement strategy.
arXiv Detail & Related papers (2023-07-20T16:46:10Z) - Using an LLM to Help With Code Understanding [13.53616539787915]
Large language models (LLMs) are revolutionizing the process of writing code.
Our plugin queries OpenAI's GPT-3.5-turbo model with four high-level requests without the user having to write explicit prompts.
We evaluate this system in a user study with 32 participants, which confirms that using our plugin can aid task completion more than web search.
arXiv Detail & Related papers (2023-07-17T00:49:06Z) - Benchmarking Knowledge-Enhanced Commonsense Question Answering via
Knowledge-to-Text Transformation [30.38055266965927]
We investigate how far can we get by exploiting external knowledge for Commonsense Question Answering.
We benchmark knowledge-enhanced CQA using a simple and effective knowledge-to-text transformation framework.
Experiments show that our knowledge-to-text framework is effective and state-of-the-art performance on CommonsenseQA dataset.
arXiv Detail & Related papers (2021-01-04T04:29:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.