Fugu-MT 論文翻訳(概要): RepoDebug: Repository-Level Multi-Task and Multi-Language Debugging Evaluation of Large Language Models

論文の概要: RepoDebug: Repository-Level Multi-Task and Multi-Language Debugging Evaluation of Large Language Models

arxiv url: http://arxiv.org/abs/2509.04078v2
Date: Mon, 08 Sep 2025 08:22:38 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-09 14:07:03.381287
Title: RepoDebug: Repository-Level Multi-Task and Multi-Language Debugging Evaluation of Large Language Models
Title（参考訳）: RepoDebug: 大規模言語モデルのリポジトリレベルマルチタスクとマルチ言語デバッグ評価
Authors: Jingjing Liu, Zeming Liu, Zihao Cheng, Mengliang He, Xiaoming Shi, Yuhang Guo, Xiangrong Zhu, Yuanfang Guo, Yunhong Wang, Haifeng Wang,
Abstract要約: LLM(Large Language Models)は、コードのデバッグに非常に熟練している。本稿ではマルチタスクおよび多言語リポジトリレベルのコードデバッグデータセットであるRepo Debugを紹介する。最高のパフォーマンスモデルである Claude 3.5 Sonnect は,リポジトリレベルのデバッグでは依然としてうまく動作しない。
参考スコア（独自算出の注目度）: 49.83481415540291
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have exhibited significant proficiency in code debugging, especially in automatic program repair, which may substantially reduce the time consumption of developers and enhance their efficiency. Significant advancements in debugging datasets have been made to promote the development of code debugging. However, these datasets primarily focus on assessing the LLM's function-level code repair capabilities, neglecting the more complex and realistic repository-level scenarios, which leads to an incomplete understanding of the LLM's challenges in repository-level debugging. While several repository-level datasets have been proposed, they often suffer from limitations such as limited diversity of tasks, languages, and error types. To mitigate this challenge, this paper introduces RepoDebug, a multi-task and multi-language repository-level code debugging dataset with 22 subtypes of errors that supports 8 commonly used programming languages and 3 debugging tasks. Furthermore, we conduct evaluation experiments on 10 LLMs, where Claude 3.5 Sonnect, the best-performing model, still cannot perform well in repository-level debugging.
Abstract（参考訳）: 大規模言語モデル(LLM)は、特に自動プログラムの修正において、コードのデバッグに非常に優れた能力を示しており、開発者の時間消費を大幅に削減し、効率を向上する可能性がある。コードデバッグの開発を促進するために、デバッギングデータセットが大幅に進歩した。しかし、これらのデータセットは主にLLMの機能レベルのコード修復機能の評価に重点を置いており、より複雑で現実的なリポジトリレベルのシナリオを無視している。いくつかのリポジトリレベルのデータセットが提案されているが、タスクや言語、エラータイプなど、制限された制限に悩まされることが多い。この課題を軽減するために,マルチタスクおよび多言語リポジトリレベルのコードデバッグデータセットであるRepoDebugを紹介した。さらに,ベストパフォーマンスのモデルであるClaude 3.5 Sonnectは,リポジトリレベルのデバッグでは十分に動作しない。

論文の概要: RepoDebug: Repository-Level Multi-Task and Multi-Language Debugging Evaluation of Large Language Models

関連論文リスト