Fugu-MT 論文翻訳(概要): HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks

論文の概要: HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks

arxiv url: http://arxiv.org/abs/2604.14709v2
Date: Thu, 23 Apr 2026 13:10:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-24 14:40:05.981754
Title: HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks
Title（参考訳）: HWE-Bench: リアルタイムハードウェアバグ修復タスクにおけるLLMエージェントのベンチマーク
Authors: Fan Cui, Hongyuan Hou, Zizhang Luo, Chenyun Yin, Yun Liang,
Abstract要約: 既存のベンチマークは主に、孤立したコンポーネントレベルのタスクでLarge Language Models (LLM)を評価する。 HWE-Benchは,LLMエージェントを現実のハードウェアバグ修正タスクで評価するための,最初の大規模リポジトリレベルのベンチマークである。
参考スコア（独自算出の注目度）: 3.958773019872771
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing benchmarks for hardware design primarily evaluate Large Language Models (LLMs) on isolated, component-level tasks such as generating HDL modules from specifications, leaving repository-scale evaluation unaddressed. We introduce HWE-Bench, the first large-scale, repository-level benchmark for evaluating LLM agents on real-world hardware bug repair tasks. HWE-Bench comprises 417 task instances derived from real historical bug-fix pull requests across six major open-source projects spanning both Verilog/SystemVerilog and Chisel, covering RISC-V cores, SoCs, and security roots-of-trust. Each task is grounded in a fully containerized environment where the agent must resolve a real bug report, with correctness validated through the project's native simulation and regression flows. The benchmark is built through a largely automated pipeline that enables efficient expansion to new repositories. We evaluate seven LLMs with four agent frameworks and find that the best agent resolves 70.7% of tasks overall, with performance exceeding 90% on smaller cores but dropping below 65% on complex SoC-level projects. We observe larger performance gaps across models than commonly reported on software benchmarks, and difficulty is driven by project scope and bug-type distribution rather than code size alone. Our failure analysis traces agent failures to three stages of the debugging process: fault localization, hardware-semantic reasoning, and cross-artifact coordination across RTL, configuration, and verification components, providing concrete directions for developing more capable hardware-aware agents.
Abstract（参考訳）: ハードウェア設計のための既存のベンチマークは主に、仕様からHDLモジュールを生成するなど、独立したコンポーネントレベルのタスクで大規模言語モデル(LLM)を評価する。 HWE-Benchは,LLMエージェントを現実のハードウェアバグ修正タスクで評価するための,最初の大規模リポジトリレベルのベンチマークである。 HWE-Benchは、Verilog/SystemVerilogとChiselにまたがる6つの主要なオープンソースプロジェクトで、実際のバグ修正プルリクエストから派生した417のタスクインスタンスで構成されており、RISC-Vコア、SoC、セキュリティルーツ・オブ・トラストをカバーしている。各タスクは、エージェントが実際のバグレポートを解決し、プロジェクトのネイティブシミュレーションと回帰フローを通じて正確性を検証する、完全にコンテナ化された環境に基盤を置いている。ベンチマークは、ほとんど自動化されたパイプラインを通じて構築され、新しいリポジトリへの効率的な拡張を可能にする。エージェントフレームワークが4つある7つのLCMを評価し、最高のエージェントが全体の70.7%のタスクを解決し、より小さなコアでは90%を超えるが、複雑なSoCレベルのプロジェクトでは65%以下であることがわかった。ソフトウェアベンチマークで一般的に報告されるよりも、モデル間のパフォーマンスギャップが大きくなるのを観察し、コードのサイズだけでなく、プロジェクトのスコープとバグタイプの分散によって困難が引き起こされる。我々の故障分析は、エージェントの障害をデバッグプロセスの3段階に遡る:障害の局所化、ハードウェア・セマンティック推論、RTL、構成、検証コンポーネント間のクロスアーティファクト調整、より有能なハードウェア・アウェア・エージェントの開発のための具体的な方向性を提供する。

論文の概要: HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks

関連論文リスト