Fugu-MT 論文翻訳(概要): SimpleDevQA: Benchmarking Large Language Models on Development Knowledge QA

論文の概要: SimpleDevQA: Benchmarking Large Language Models on Development Knowledge QA

arxiv url: http://arxiv.org/abs/2512.08867v1
Date: Tue, 09 Dec 2025 17:58:36 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-10 22:28:08.076961
Title: SimpleDevQA: Benchmarking Large Language Models on Development Knowledge QA
Title（参考訳）: SimpleDevQA: 開発知識に関するQAに基づく大規模言語モデルのベンチマーク
Authors: Jing Zhang, Lianghong Guo, Yanlin Wang, Mingwei Liu, Jiachi Chen, Yuchi Ma, Ensheng Shi, Terry Yue Zhuo, Hongyu Zhang, Zibin Zheng,
Abstract要約: Dev Knowledge QAタスクはインタラクションの39.6%を占めている。実際のDev Knowledge QA対話のわずか27.5%は、コード理解に重点を置いている。実世界のDev Knowledge QAダイアログの17.1%しかベンチマークの構築に使用できない。
参考スコア（独自算出の注目度）: 58.75982433502236
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Development Knowledge Question Answering (Dev Knowledge QA) task aims to provide natural language answers to knowledge-seeking questions during software development. To investigate its importance and to what extent it has been explored, we analyze real user-LLM dialogues from WildChat and find that: (1) The Dev Knowledge QA task accounts for 39.6% of interactions(highest among all tasks), revealing broad knowledge needs beyond code generation (32.3%). (2) Only 27.5% of real Dev Knowledge QA dialogues focus on code understanding, leaving out development knowledge-seeking. (3) Only 17.1% of real-world Dev Knowledge QA dialogues can be used for constructing a benchmark. Existing benchmarks have two primary limitations for evaluating the Dev Knowledge QA capability of LLMs. First, existing benchmarks offer a limited development knowledge scope, mainly focusing on code understanding and neglecting broader knowledge during development. Second, some benchmarks are not built from real user queries. To bridge this gap, we design a three-phase pipeline that transforms real-world dialogue into simple development knowledge-seeking QA pairs. Through this pipeline, we introduce SimpleDevQA, a multilingual benchmark derived from real user dialogues. It contains 2,740 QA pairs in three languages (English, Chinese, and Russian), and focuses on questions with unique, short, and verifiable answers for accurate and simple evaluation. Experiments show that: Code LLMs generally outperform general LLMs of similar scale; Knowledge injection with the Retrieval-Augmented Generation (RAG) strategy can boost LLM accuracy by 11.3% on average; LLMs show systematic overconfidence in Dev Knowledge QA, and the answering accuracy of LLMs shows a positive correlation with their stated confidence; Generally, LLMs with stronger code generation performance also exhibit stronger performance in Dev Knowledge QA.
Abstract（参考訳）: 開発知識質問回答(Dev Knowledge Question Answering、Dev Knowledge QA)タスクは、ソフトウェア開発中に知識を求める質問に対する自然言語による回答を提供することを目的としている。本研究は,WildChatの実際のユーザ-LLM対話を分析し,(1)Dev Knowledge QAタスクが39.6%のインタラクション(すべてのタスクの中で最も高い)を担っており,コード生成(32.3%)を超える幅広い知識の必要性を明らかにしている。 2) 実際の開発知識QA対話のわずか27.5%は、コード理解に重点を置いており、開発知識を追求している。 (3) 実世界のDev Knowledge QA対話の17.1%のみがベンチマークの構築に使用することができる。既存のベンチマークには、LLMのDev Knowledge QA機能を評価するための2つの主要な制限がある。まず、既存のベンチマークは限られた開発知識の範囲を提供し、主にコード理解と開発期間中のより広範な知識を無視します。第二に、実際のユーザクエリから構築されていないベンチマークもある。このギャップを埋めるために、現実世界の対話を単純な開発知識を求めるQAペアに変換する3相パイプラインを設計する。このパイプラインを通じて、実際のユーザ対話から派生した多言語ベンチマークであるSimpleDevQAを導入する。 3つの言語(英語、中国語、ロシア語)に2,740のQAペアが含まれており、正確で単純な評価のために、ユニークで短く、検証可能な回答を持つ質問に焦点を当てている。コードLLMは一般的に、同様のスケールのLLMよりも優れており、レトリーバル拡張生成(RAG)戦略による知識注入は、平均して11.3%のLLM精度を向上し、LLMは、Dev Knowledge QAにおける体系的な過信を示し、LLMの回答精度は、彼らの主張した信頼と正の相関を示し、コード生成性能の強いLLMは、Dev Knowledge QAにおけるより強力なパフォーマンスを示す。

論文の概要: SimpleDevQA: Benchmarking Large Language Models on Development Knowledge QA

関連論文リスト