Fugu-MT 論文翻訳(概要): Delulu: A Verified Multi-Lingual Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks

論文の概要: Delulu: A Verified Multi-Lingual Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks

arxiv url: http://arxiv.org/abs/2605.07024v1
Date: Thu, 07 May 2026 23:12:00 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:38.666851
Title: Delulu: A Verified Multi-Lingual Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks
Title（参考訳）: Delulu: 中間タスクにおけるコード幻覚検出のための多言語ベンチマーク
Authors: Mahdi Erfanian, Nelson Daniel Troncoso, Aashna Garg, Amabel Gale, Xiaoyu Liu, Pareesa Ameneh Golnari, Shengyu Fu,
Abstract要約: コード生成のための大規模言語モデルは、Fillin-theMiddle (FIM)タスクにおいて幻覚を頻繁に生成する。 Deluluは、7つの言語と4つの幻覚型で1,951個のFIMサンプルを検証した多言語ベンチマークである。 0.5B-32Bパラメータにまたがる5つのファミリーから,11個のオープンウェイトFIMモデルを評価した。
参考スコア（独自算出の注目度）: 4.089259624354187
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models for code generation frequently produce hallucinations in Fill-in-the-Middle (FIM) tasks -- plausible but incorrect completions such as invented API methods, invalid parameters, undefined variables, or non-existent imports. These failures pass superficial review yet introduce runtime errors. We introduce Delulu, a verified multi-lingual benchmark of 1,951 FIM samples across 7 languages and 4 hallucination types. Samples are curated through an adversarial pipeline: a frontier LLM generates plausible hallucinations, four diverse judge models evaluate them, embedding-based clustering mines progressively harder examples, self-contained Docker containers verify that golden completions compile while hallucinated variants produce the expected runtime error, and a final human-expert review removes any remaining biased or trivially decidable samples. We evaluate 11 open-weight FIM models from five families spanning 0.5B-32B parameters: a six-point Qwen2.5-Coder scaling slate, plus a cross-family slate (CodeLlama, DeepSeek-Coder-V2, StarCoder2). The strongest model reaches only 84.5% pass@1, no family exceeds 0.77 Edit Similarity, and every family produces hallucination-aligned completions on a non-trivial share of samples, confirming that the difficulty exposed by Delulu is task-intrinsic rather than family-specific. We release the benchmark, containers, and evaluation framework at https://github.com/microsoft/delulu.
Abstract（参考訳）: コード生成のための大規模言語モデルは、しばしばFill-in-the-Middle(FIM)タスクの幻覚を生成する。これらの障害は表面的なレビューをパスしますが、実行時のエラーが発生します。 Deluluは、7つの言語と4つの幻覚型で1,951個のFIMサンプルを検証した多言語ベンチマークである。サンプルは、敵対的なパイプラインを通じてキュレートされる:フロンティアのLMは、プラプシブルな幻覚を生成する。4つの多様な判断モデルは、それらを評価し、埋め込みベースのクラスタリングマイニングは徐々に難しい例である。 0.5B-32Bパラメータにまたがる6点のQwen2.5-Coderスケーリングスレートとクロスファミリースレート(CodeLlama, DeepSeek-Coder-V2, StarCoder2)の5つのファミリーから11個のオープンウェイトFIMモデルを評価する。最強のモデルは84.5%のpass@1に留まり、家族は0.77のEdit類似性を超えず、すべての家族が非自明なサンプルのシェアで幻覚に整列した完了を生成し、家族固有のものではなく、Deluluによって露呈される困難はタスク固有のものであることを確認している。ベンチマーク、コンテナ、評価フレームワークはhttps://github.com/microsoft/delulu.comで公開しています。

論文の概要: Delulu: A Verified Multi-Lingual Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks

関連論文リスト