Fugu-MT 論文翻訳(概要): WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

論文の概要: WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

arxiv url: http://arxiv.org/abs/2605.22664v1
Date: Thu, 21 May 2026 16:06:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-22 20:14:18.602977
Title: WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance
Title（参考訳）: WorkstreamBench: ファイナンスにおけるエンドツーエンドのスプレッドシートタスク上でのLCMエージェントの評価
Authors: Thomson Yen, Julian Poeltl, Harshith Srinivas Gear, Yilin Meng, Joshua Fan, Adam Shen, Yili Liu, Ali Bauyrzhan, Siri Du, Haoyang Liu, Daniel Guetta, Hongseok Namkoong,
Abstract要約: フロンティアAIラボは、スプレッドシート全体をスクラッチから構築できるエージェントを開発した。これは金融において特に重要であり、金融モデリング、予測、シナリオ分析といった中核がスプレッドシートを通じて一般的に行われている。既存のスプレッドシートベンチマークでは、この高度な能力は測定されず、代わりに質問回答や単一形式の編集に重点を置いている。
参考スコア（独自算出の注目度）: 4.787072076364137
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLM agents are increasingly expected to carry out end-to-end workflows, producing complete artifacts from high-level user instructions. To meet enterprise needs, frontier AI labs have developed agents that can construct entire spreadsheets from scratch. This is especially relevant in finance, where core workflows such as financial modeling, forecasting, and scenario analysis are commonly conducted through spreadsheets. Yet, existing spreadsheet benchmarks do not measure this advanced capability, focusing instead on question-answering or single-formula edits. To address this gap, we provide one of the first evaluations of agents on end-to-end spreadsheet tasks, focusing on economically critical financial workflows such as modeling and scenario analysis. Since deliverables therein are routinely reviewed and revised by multiple stakeholders, judging their quality necessarily involves high-level criteria such as readability or ease of modification. To reflect the multidimensional nature of solution quality, we develop an evaluation taxonomy comprising three dimensions: Accuracy, Formula, and Format, each comprising fine-grained criteria that reflect professional standards. The Claude family leads the benchmark and produces the most professional-looking outputs in our qualitative review, but even the strongest agents frequently fall short of professional finance standards and degrade sharply as the difficulty increases beyond a few chained calculations. This suggests that current agents are not yet able to reliably produce professional-quality spreadsheets at the level of complexity real-world workflows demand.
Abstract（参考訳）: LLMエージェントは、エンド・ツー・エンドのワークフローを実行し、ハイレベルなユーザー・インストラクションから完全なアーティファクトを生成することがますます期待されている。企業のニーズを満たすために、フロンティアAIラボは、スプレッドシート全体をスクラッチから構築できるエージェントを開発した。これは金融において特に重要であり、金融モデリング、予測、シナリオ分析といった中核的なワークフローは、一般的にスプレッドシートを通して行われる。しかし、既存のスプレッドシートベンチマークはこの高度な能力を測定せず、代わりに質問回答や単一形式の編集に重点を置いている。このギャップに対処するために、我々は、モデリングやシナリオ分析のような経済的に重要な金融ワークフローに焦点を当てた、エンドツーエンドのスプレッドシートタスクにおけるエージェントの最初の評価の1つを提供する。成果物は、複数の利害関係者によって定期的にレビューされ、修正されるので、その品質を判断するには、可読性や変更の容易さといった高レベルな基準が必ず必要である。ソリューション品質の多次元的な性質を反映するため,専門的基準を反映したきめ細かい基準を含む,正確性,フォーミュラ,フォーマットの3つの次元からなる評価分類法を開発した。クロード家(Claude family)は、このベンチマークをリードし、質的なレビューの中で最もプロらしく見えるアウトプットを生産するが、最強のエージェントでさえ、プロの財務基準に反し、いくつかの連鎖した計算以上の困難が増すにつれて、急激に低下することが多い。これは、現在のエージェントが、現実のワークフローが要求する複雑さのレベルにおいて、プロフェッショナル品質のスプレッドシートを確実に作成できないことを示唆している。

論文の概要: WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

関連論文リスト