Fugu-MT 論文翻訳(概要): QCoder Benchmark: Bridging Language Generation and Quantum Hardware through Simulator-Based Feedback

論文の概要: QCoder Benchmark: Bridging Language Generation and Quantum Hardware through Simulator-Based Feedback

arxiv url: http://arxiv.org/abs/2510.26101v2
Date: Sat, 01 Nov 2025 03:02:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-04 14:12:28.013608
Title: QCoder Benchmark: Bridging Language Generation and Quantum Hardware through Simulator-Based Feedback
Title（参考訳）: QCoderベンチマーク: シミュレータベースのフィードバックによるブリッジ言語生成と量子ハードウェア
Authors: Taku Mikuriya, Tatsuya Ishigaki, Masayuki Kawarada, Shunya Minami, Tadashi Kadowaki, Yohichi Suzuki, Soshun Naito, Shunya Takata, Takumi Kato, Tamotsu Basseda, Reo Yamada, Hiroya Takamura,
Abstract要約: 本稿では,大規模言語モデル(LLM)を量子プログラミングで評価する評価フレームワークであるQCoder Benchmarkを紹介する。提案ベンチマークは,従来のPython実行以上の量子シミュレータ環境による評価をサポートする。 GPT-4oのような先進的なモデルでさえ18.97%の精度しか達成せず、ベンチマークの難しさを強調している。対照的に、o3のような推論ベースのモデルは78%の精度に達し、人間の書いたコードの平均成功率を上回っている。
参考スコア（独自算出の注目度）: 7.355017519768158
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have increasingly been applied to automatic programming code generation. This task can be viewed as a language generation task that bridges natural language, human knowledge, and programming logic. However, it remains underexplored in domains that require interaction with hardware devices, such as quantum programming, where human coders write Python code that is executed on a quantum computer. To address this gap, we introduce QCoder Benchmark, an evaluation framework that assesses LLMs on quantum programming with feedback from simulated hardware devices. Our benchmark offers two key features. First, it supports evaluation using a quantum simulator environment beyond conventional Python execution, allowing feedback of domain-specific metrics such as circuit depth, execution time, and error classification, which can be used to guide better generation. Second, it incorporates human-written code submissions collected from real programming contests, enabling both quantitative comparisons and qualitative analyses of LLM outputs against human-written codes. Our experiments reveal that even advanced models like GPT-4o achieve only around 18.97% accuracy, highlighting the difficulty of the benchmark. In contrast, reasoning-based models such as o3 reach up to 78% accuracy, outperforming averaged success rates of human-written codes (39.98%). We release the QCoder Benchmark dataset and public evaluation API to support further research. (Codes and datasets are available at https://qcoder-bench.github.io/ )
Abstract（参考訳）: 大規模言語モデル(LLM)は、プログラミングコードの自動生成にますます応用されている。このタスクは、自然言語、人間の知識、プログラミングロジックを橋渡しする言語生成タスクと見なすことができる。しかし、量子プログラミングのようなハードウェアデバイスとのインタラクションを必要とする領域では、人間のコーダーが量子コンピュータ上で実行されたPythonコードを記述している領域では、まだ探索されていない。このギャップに対処するために、シミュレーションハードウェアデバイスからのフィードバックで量子プログラミングのLLMを評価する評価フレームワークであるQCoder Benchmarkを紹介した。私たちのベンチマークには2つの重要な特徴があります。まず、従来のPython実行以上の量子シミュレータ環境による評価をサポートし、回路深度、実行時間、エラー分類といったドメイン固有のメトリクスのフィードバックを、より良い生成を導くために使用できる。第二に、実際のプログラミングコンテストから収集された人間によるコード入力を組み込んでおり、LLM出力と人間によるコードとの定量的比較と質的分析の両方を可能にしている。我々の実験によると、GPT-4oのような先進モデルでさえ18.97%の精度しか達成できず、ベンチマークの難しさを強調している。対照的に、o3のような推論ベースのモデルは78%の精度に達し、人間の書いたコードの平均成功率(39.98%)を上回っている。さらなる研究を支援するため、QCoder Benchmarkデータセットと公開評価APIをリリースしました。 (コードとデータセットはhttps://qcoder-bench.github.io/)。

論文の概要: QCoder Benchmark: Bridging Language Generation and Quantum Hardware through Simulator-Based Feedback

関連論文リスト