Fugu-MT 論文翻訳(概要): CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging

論文の概要: CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging

arxiv url: http://arxiv.org/abs/2605.08455v1
Date: Fri, 08 May 2026 20:24:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:49.661173
Title: CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging
Title（参考訳）: CUDABeaver: LLMベースの自動CUDAデバッグのベンチマーク
Authors: Shiyang Li, Haoyang Chen, Mattia Fazzini, Caiwen Ding,
Abstract要約: 私たちは、実際の失敗するワークスペースで生成されたプログラムを評価するベンチマークであるBEAVERを紹介します。各タスクは、壊れた候補、ネイティブビルド/テストコマンド、生エラーエビデンス、単一のファイルを提供する。プロトコルを意識した評価は、パフォーマンスロストレランスをより忠実に評価できることを示す。
参考スコア（独自算出の注目度）: 18.460942231908376
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Debugging CUDA programs has long been challenging because failures often arise from subtle interactions among hardware behavior, compiler decisions, memory hierarchy, and asynchronous execution. More importantly, with the rapid expansion of GPU usage across scientific computing, machine learning, graphics, and systems workloads, CUDA debugging has become more challenging than ever. Current evaluations of LLM-based CUDA programming largely miss this setting: a model can pass correctness tests with repair by degeneration, simplifying the CUDA code into a safer but slower program that abandons the original optimization structure. We introduce CUDABEAVER, a benchmark for CUDA debugging from real failing workspaces produced during LLM-based CUDA generation. Each task provides the broken candidate, native build/test commands, raw error evidence, and a single editable file. CUDABEAVER evaluates whether a fixer truly repairs the failing CUDA code or merely finds a slower test-passing replacement, reporting results by failure category, debugging trajectory, stagnation mode, and performance preservation. We further propose pass@k(M,C,A), a protocol-conditional CUDA debugging metric by making the fixer M, corpus C, and protocol axes Aexplicit. Using this metric across 213 tasks and seven frontier LLMs, we show that protocol-aware evaluation gives a more faithful view of CUDA debugging ability: when performance-loss tolerance is high, fixers appear much stronger, but even a minor stricter performance requirement can sharply reduce measured success, shifting scores by up to 40 percentage points.
Abstract（参考訳）: CUDAプログラムのデバッグは、ハードウェアの動作やコンパイラの決定、メモリ階層、非同期実行の間の微妙な相互作用から失敗がしばしば生じるため、長い間困難だった。さらに重要なのは、科学計算、機械学習、グラフィックス、システムのワークロードにまたがるGPU使用の急速な拡大により、CUDAデバッグはこれまで以上に困難になっていることだ。 LLMベースのCUDAプログラミングの現在の評価では、この設定をほとんど見逃している:モデルは、デジェネレーションによる修正による正確性テストに合格し、CUDAコードを元の最適化構造を捨てたより安全な、より遅いプログラムに単純化する。 LLM ベースの CUDA 生成時に生成された実際のワークスペースから CUDA デバッグのためのベンチマークである CUDABEAVER を紹介する。各タスクは、壊れた候補、ネイティブビルド/テストコマンド、生エラーエビデンス、単一の編集可能なファイルを提供する。 CUDABEAVERは、フィクスチャが本当に失敗したCUDAコードを修復したのか、それともテストパスが遅いのか、失敗カテゴリによる報告結果、デバッグの軌跡、停止モード、パフォーマンスの保存しか見つからないのかを評価する。さらに、固定器M、コーパスC、プロトコルxをAexplicitにすることで、プロトコル条件のCUDAデバッグメトリックであるpass@k(M,C,A)を提案する。 213のタスクと7つのフロンティアLCMにまたがるこの測定値を用いて、プロトコルを意識した評価により、CUDAデバッグ能力をより忠実に把握できることを示す。

論文の概要: CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging

関連論文リスト