Fugu-MT 論文翻訳(概要): Jackal: A Real-World Execution-Based Benchmark Evaluating Large Language Models on Text-to-JQL Tasks

論文の概要: Jackal: A Real-World Execution-Based Benchmark Evaluating Large Language Models on Text-to-JQL Tasks

arxiv url: http://arxiv.org/abs/2509.23579v1
Date: Sun, 28 Sep 2025 02:23:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.301522
Title: Jackal: A Real-World Execution-Based Benchmark Evaluating Large Language Models on Text-to-JQL Tasks
Title（参考訳）: Jackal: テキストからJQLタスク上の大規模言語モデルを評価する実世界の実行ベースベンチマーク
Authors: Kevin Frank, Anmol Gulati, Elias Lumer, Sindy Campagna, Vamse Kumar Subbiah,
Abstract要約: 自然言語クエリをJira JQLにマッピングするための、オープンで実世界の実行ベースのベンチマークはありません。検証済みのJQLクエリとペアリングされた10万の自然言語(NL)リクエストと、20万以上の問題のあるライブJiraインスタンス上での実行ベースの結果からなる、新しい大規模テキスト・トゥ・JQLベンチマークであるJackalを紹介した。パラメータサイズ、オープンおよびクローズドソースモデル、実行精度、正確な一致、正準正則整合を対象とする23大言語モデル(LLM)のテキストからJQL結果について報告する。
参考スコア（独自算出の注目度）: 0.9374059084973779
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Enterprise teams rely on the Jira Query Language (JQL) to retrieve and filter issues from Jira. Yet, to our knowledge, there is no open, real-world, execution-based benchmark for mapping natural language queries to JQL. We introduce Jackal, a novel, large-scale text-to-JQL benchmark comprising 100,000 natural language (NL) requests paired with validated JQL queries and execution-based results on a live Jira instance with over 200,000 issues. To reflect real-world usage, each JQL query is associated with four types of user requests: (i) Long NL, (ii) Short NL, (iii) Semantically Similar, and (iv) Semantically Exact. We release Jackal, a corpus of 100,000 text-to-JQL pairs, together with an execution-based scoring toolkit, and a static snapshot of the evaluated Jira instance for reproducibility. We report text-to-JQL results on 23 Large Language Models (LLMs) spanning parameter sizes, open and closed source models, across execution accuracy, exact match, and canonical exact match. In this paper, we report results on Jackal-5K, a 5,000-pair subset of Jackal. On Jackal-5K, the best overall model (Gemini 2.5 Pro) achieves only 60.3% execution accuracy averaged equally across four user request types. Performance varies significantly across user request types: (i) Long NL (86.0%), (ii) Short NL (35.7%), (iii) Semantically Similar (22.7%), and (iv) Semantically Exact (99.3%). By benchmarking LLMs on their ability to produce correct and executable JQL queries, Jackal exposes the limitations of current state-of-the-art LLMs and sets a new, execution-based challenge for future research in Jira enterprise data.
Abstract（参考訳）: 企業チームはJira Query Language(JQL)を使用して、問題の検索とフィルタリングを行っている。しかし、私たちの知る限り、自然言語クエリをJQLにマッピングするオープンで実世界の実行ベースのベンチマークはありません。検証済みのJQLクエリとペアリングされた10万の自然言語(NL)リクエストと、20万以上の問題のあるライブJiraインスタンス上での実行ベースの結果からなる、新しい大規模テキスト・トゥ・JQLベンチマークであるJackalを紹介した。実世界の使い方を反映するために、各JQLクエリは4つのタイプのユーザリクエストに関連付けられている。 (i)ロングNL (ii)短いNL (三)意味的に類似、及び (四)セマンティックエクササイズ。 Jackalは10万のテキスト対JQLのコーパスで、実行ベースのスコアリングツールキットと、評価済みのJiraインスタンスの静的スナップショットを再現性のためにリリースしています。パラメータサイズ、オープンおよびクローズドソースモデル、実行精度、正確な一致、正準正則整合を対象とする23大言語モデル(LLM)のテキスト対JQL結果について報告する。本稿では,ジャカルの5,000対のサブセットであるジャカル5Kについて報告する。 Jackal-5Kでは、最高の全体モデル(Gemini 2.5 Pro)が4つのユーザ要求タイプで等しく評価される実行精度は60.3%に過ぎなかった。パフォーマンスはユーザの要求タイプによって大きく異なります。 (i)長NL(86.0%) (二)短NL(35.7%) (三)意味的に類似(22.7%)、及び (4)Semantically Exact(99.3%) 正しい、実行可能なJQLクエリを生成する能力に基づいてLLMをベンチマークすることにより、Jackalは現在の最先端のLLMの制限を公開し、Jiraエンタープライズデータにおける将来の研究に新たな実行ベースの課題を設定できる。

論文の概要: Jackal: A Real-World Execution-Based Benchmark Evaluating Large Language Models on Text-to-JQL Tasks

関連論文リスト