Fugu-MT 論文翻訳(概要): Developer-LLM Conversations: An Empirical Study of Interactions and Generated Code Quality

論文の概要: Developer-LLM Conversations: An Empirical Study of Interactions and Generated Code Quality

arxiv url: http://arxiv.org/abs/2509.10402v1
Date: Fri, 12 Sep 2025 16:52:49 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-15 16:03:08.16907
Title: Developer-LLM Conversations: An Empirical Study of Interactions and Generated Code Quality
Title（参考訳）: 開発者とLLMの会話: インタラクションとコード品質の生成に関する実証的研究
Authors: Suzhen Zhong, Ying Zou, Bram Adams,
Abstract要約: 大規模言語モデル(LLM)は現代のソフトウェア開発に不可欠なものになりつつある。実世界の開発者とLLMの会話のデータセットであるCodeChatを活用しています。 LLMレスポンスは開発者のプロンプトよりもかなり長いことが分かりました。
参考スコア（独自算出の注目度）: 4.05144752916486
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are becoming integral to modern software development workflows, assisting developers with code generation, API explanation, and iterative problem-solving through natural language conversations. Despite widespread adoption, there is limited understanding of how developers interact with LLMs in practice and how these conversational dynamics influence task outcomes, code quality, and software engineering workflows. To address this, we leverage CodeChat, a large dataset comprising 82,845 real-world developer-LLM conversations, containing 368,506 code snippets generated across over 20 programming languages, derived from the WildChat dataset. We find that LLM responses are substantially longer than developer prompts, with a median token-length ratio of 14:1. Multi-turn conversations account for 68% of the dataset and often evolve due to shifting requirements, incomplete prompts, or clarification requests. Topic analysis identifies web design (9.6% of conversations) and neural network training (8.7% of conversations) as the most frequent LLM-assisted tasks. Evaluation across five languages (i.e., Python, JavaScript, C++, Java, and C#) reveals prevalent and language-specific issues in LLM-generated code: generated Python and JavaScript code often include undefined variables (83.4% and 75.3% of code snippets, respectively); Java code lacks required comments (75.9%); C++ code frequently omits headers (41.1%) and C# code shows unresolved namespaces (49.2%). During a conversation, syntax and import errors persist across turns; however, documentation quality in Java improves by up to 14.7%, and import handling in Python improves by 3.7% over 5 turns. Prompts that point out mistakes in code generated in prior turns and explicitly request a fix are most effective for resolving errors.
Abstract（参考訳）: 大規模言語モデル(LLM)は、開発者のコード生成、API説明、自然言語会話による反復的な問題解決を支援する、現代のソフトウェア開発ワークフローに不可欠なものになりつつある。広く採用されているにもかかわらず、開発者が実際にLLMと対話する方法や、これらの会話のダイナミクスがタスク結果、コード品質、ソフトウェアエンジニアリングワークフローにどのように影響するかについては、限定的な理解がある。これを解決するために、私たちは、WildChatデータセットから派生した20以上のプログラミング言語で生成される368,506個のコードスニペットを含む、82,845の現実世界の開発者とLLMの会話からなる大規模なデータセットであるCodeChatを活用しました。 LLM応答は開発者のプロンプトよりもかなり長く,トークン長の中央値が14:1であることがわかった。マルチターン会話はデータセットの68%を占め、要求の変化、不完全なプロンプト、明確化要求のためにしばしば進化する。トピック分析はウェブデザイン(会話の9.6%)とニューラルネットワークトレーニング(会話の8.7%)を最も頻繁なLCM支援タスクとしている。 5つの言語(例えば、Python、JavaScript、C++、Java、C#)における評価では、LLM生成コードの一般的な問題と言語固有の問題を明らかにしている: 生成されたPythonとJavaScriptコードは、それぞれ83.4%と75.3%のコードスニペットを含むことが多い。Javaコードは、必要なコメントを欠いている(75.9%)、C++コードはヘッダーを省略する(41.1%)、C#コードは未解決の名前空間(49.2%)を示す。しかし、Javaのドキュメント品質は最大14.7%向上し、Pythonのインポート処理は5ターンで3.7%向上している。前のターンで生成されたコードのミスを指摘し、修正を明示的に要求するプロンプトは、エラーを解決するのに最も効果的である。

論文の概要: Developer-LLM Conversations: An Empirical Study of Interactions and Generated Code Quality

関連論文リスト