Fugu-MT 論文翻訳(概要): An Empirical Evaluation of Locally Deployed LLMs for Bug Detection in Python Code

論文の概要: An Empirical Evaluation of Locally Deployed LLMs for Bug Detection in Python Code

arxiv url: http://arxiv.org/abs/2604.23361v1
Date: Sat, 25 Apr 2026 16:05:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:07.294932
Title: An Empirical Evaluation of Locally Deployed LLMs for Bug Detection in Python Code
Title（参考訳）: Python コードにおけるバグ検出のためのローカル展開 LLM の実証評価
Authors: Jelena Ilić Vulićević,
Abstract要約: 大規模言語モデル(LLM)は、幅広いソフトウェアエンジニアリングタスクにおいて強力なパフォーマンスを示している。 LLaMA 3.2 と Mistral という2つのローカルにデプロイされた LLM を実世界のPython バグ検出のために体系的に評価した。機能レベルでのゼロショットプロンプトアプローチと,キーワードベースの自動評価フレームワークを用いて,17プロジェクトにわたる349のバグを評価した。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have demonstrated strong performance on a wide range of software engineering tasks, including code generation and analysis. However, most prior work relies on cloud-based models or specialized hardware, limiting practical applicability in privacy-sensitive or resource-constrained environments. In this paper, we present a systematic empirical evaluation of two locally deployed LLMs, LLaMA 3.2 and Mistral, for real-world Python bug detection using the BugsInPy benchmark. We evaluate 349 bugs across 17 projects using a zero-shot prompting approach at the function level and an automated keyword-based evaluation framework. Our results show that locally executed models achieve accuracy between 43% and 45%, while producing a large proportion of partially correct responses that identify problematic code regions without pinpointing the exact fix. Performance varies significantly across projects, highlighting the importance of codebase characteristics. The results demonstrate that local models can identify a meaningful share of bugs, though precise localization remains difficult for locally executed LLMs, particularly when handling complex and context dependent bugs in realistic development scenarios.
Abstract（参考訳）: 大規模言語モデル(LLM)は、コード生成や解析を含む幅広いソフトウェアエンジニアリングタスクにおいて、強力なパフォーマンスを示している。しかしながら、以前の作業のほとんどはクラウドベースのモデルや特別なハードウェアに依存しており、プライバシに敏感な環境やリソースに制約のある環境での実践的適用性を制限している。本稿では,BugsInPyベンチマークを用いて実世界のPythonバグ検出を行うために,LLaMA 3.2 と Mistral の2つのローカルデプロイ LLM の系統的評価を行った。機能レベルでのゼロショットプロンプトアプローチと,キーワードベースの自動評価フレームワークを用いて,17プロジェクトにわたる349のバグを評価した。その結果,局所的に実行したモデルでは43%から45%の精度で精度が得られ,問題のあるコード領域を特定できる部分的正解率も高いことがわかった。パフォーマンスはプロジェクトによって大きく異なり、コードベースの特徴の重要性を強調している。この結果から,ローカルモデルでは有意義なバグの特定が可能であることが示されたが,局所的に実行されているLLMでは,特に現実的な開発シナリオにおいて複雑でコンテキストに依存したバグを扱う場合,正確なローカライゼーションは難しいままである。

論文の概要: An Empirical Evaluation of Locally Deployed LLMs for Bug Detection in Python Code

関連論文リスト