Fugu-MT 論文翻訳(概要): QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies

論文の概要: QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies

arxiv url: http://arxiv.org/abs/2604.15151v1
Date: Thu, 16 Apr 2026 15:31:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-17 21:29:31.984659
Title: QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies
Title（参考訳）: QuantCode-Bench: 実行可能なアルゴリズムトレーディング戦略を生成するための大規模言語モデルの能力を評価するベンチマーク
Authors: Alexey Khoroshilov, Alexey Chernysh, Orkhan Ekhtibarov, Nini Kamkia, Dmitry Zmitrovich,
Abstract要約: 我々は,Backtrader フレームワークの戦略を生成する上で,現代の LLM の体系的評価のためのベンチマークである QuantCode-Bench を提案する。現在のモデルの主な制限は構文ではなく、トレーディングロジックの適切な運用化、適切なAPI使用、タスクセマンティクスへの準拠である。
参考スコア（独自算出の注目度）: 0.04660328753262074
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models have demonstrated strong performance on general-purpose programming tasks, yet their ability to generate executable algorithmic trading strategies remains underexplored. Unlike standard code benchmarks, trading-strategy generation requires simultaneous mastery of domain-specific financial logic, knowledge of a specialized API, and the ability to produce code that is not only syntactically correct but also leads to actual trades on historical data. In this work, we present QuantCode-Bench, a benchmark for the systematic evaluation of modern LLMs in generating strategies for the Backtrader framework from textual descriptions in English. The benchmark contains 400 tasks of varying difficulty collected from Reddit, TradingView, StackExchange, GitHub, and synthetic sources. Evaluation is conducted through a multi-stage pipeline that checks syntactic correctness, successful backtest execution, the presence of trades, and semantic alignment with the task description using an LLM judge. We compare state-of-the-art models in two settings: single-turn, where the strategy must be generated correctly on the first attempt, and agentic multi-turn, where the model receives iterative feedback and may repair its errors. We analyze the failure modes across different stages of the pipeline and show that the main limitations of current models are not related to syntax, but rather to the correct operationalization of trading logic, proper API usage, and adherence to task semantics. These findings suggest that trading strategy generation constitutes a distinct class of domain-specific code generation tasks in which success requires not only technical correctness, but also alignment between natural-language descriptions, financial logic, and the observable behavior of the strategy on data.
Abstract（参考訳）: 大規模言語モデルは汎用プログラミングタスクにおいて高い性能を示してきたが、実行可能なアルゴリズムトレーディング戦略を生成する能力はいまだ探索されていない。標準的なコードベンチマークとは異なり、トレーディングストラテジー生成には、ドメイン固有の財務ロジックの同時習得、専門的なAPIの知識、構文的に正しいだけでなく、歴史的なデータに関する実際の取引につながるコードを生成する能力が必要である。本稿では,現代LLMの体系的評価のためのベンチマークであるQuantCode-Benchを紹介する。このベンチマークには、Reddit、TradingView、StackExchange、GitHub、および合成ソースから収集されたさまざまな困難を伴う400のタスクが含まれている。評価は,構文的正確性,バックテストの実行成功,取引の有無,LLM判定器を用いたタスク記述とのセマンティックアライメントをチェックする多段階パイプラインを通じて行われる。まず,1回の試行で戦略を正しく生成しなければならないシングルターンと,反復的なフィードバックを受け取り,エラーを修復するエージェント的マルチターンの2つの設定で,最先端のモデルを比較した。パイプラインのさまざまなステージにわたる障害モードを分析し、現在のモデルの主な制限が構文ではなく、トレーディングロジックの適切な運用化、適切なAPI使用、タスクセマンティクスへの準拠であることを示す。これらの結果から, 取引戦略生成は, 技術的正確性だけでなく, 自然言語記述, 財務論理, データ上の戦略の可観測行動の整合性も要求される, ドメイン固有のコード生成タスクの異なるクラスを構成することが示唆された。

論文の概要: QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies

関連論文リスト