Fugu-MT 論文翻訳(概要): T2S-Metrics: Unified Library for Evaluating SPARQL Queries Generated From Natural Language

論文の概要: T2S-Metrics: Unified Library for Evaluating SPARQL Queries Generated From Natural Language

arxiv url: http://arxiv.org/abs/2604.26971v1
Date: Wed, 22 Apr 2026 08:13:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-01 16:31:53.682606
Title: T2S-Metrics: Unified Library for Evaluating SPARQL Queries Generated From Natural Language
Title（参考訳）: T2S-Metrics:自然言語から生成されたSPARQLクエリを評価する統一ライブラリ
Authors: Yousouf Taghzouti, Tao Jiang, Camille Juigné, Benjamin Navet, Fabien Gandon, Franck Michel, Louis-Felix Nothias,
Abstract要約: SPARQLに基づく評価に特化して設計されたオープンソースの統一評価ライブラリであるt2s-metricsを提案する。 t2s-metricsは、文献や実践的な評価ニーズから収集された、20以上の評価指標の幅広いセットを提供する。我々は t2s-metrics が知識グラフに対する質問応答において, 体系的, 標準化された評価に向けた必要なステップであると主張している。
参考スコア（独自算出の注目度）: 3.0216239385077572
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The evaluation of Question Answering (QA) systems over Knowledge Graphs has historically suffered from fragmentation, inconsistency, and limited reproducibility. While significant progress has been made in semantic parsing and SPARQL query generation, evaluation methodologies remain diverse, ad hoc, and often incomparable across studies. Existing benchmarks typically focus on a small subset of metrics, such as query exact match or answer-level F1, neglecting syntactic validity, semantic faithfulness, execution correctness, results ranking quality, and computational efficiency. In this paper, we present t2s-metrics, an open-source, extensible, and unified evaluation library designed specifically for SPARQL query comparison and execution-based assessment. t2s-metrics provides a broad and extensible set of over 20 evaluation metrics, collected from the literature and practical evaluation needs, spanning lexical, syntactic, semantic, structural, execution-based and ranking-based dimensions. These include query-based metrics such as token-level Precision, Recall, and F1; BLEU, ROUGE, METEOR, and CodeBLEU variants; variable-normalized metrics (SP-BLEU, SP-F1); graph-and URI-based exact match metrics; as well as answer set-based metrics such as F1-QALD and Jaccard similarity; ranking metrics including MRR, NDCG, P@k, and Hit@k; and LLM-as-a-Judge metrics. Taking inspiration from the ir-metrics library for Information Retrieval, t2s-metrics provides a modular abstraction layer that decouples metric specification from implementation, enabling consistent, transparent, and reproducible evaluation of SPARQLbased QA systems. We argue that t2s-metrics constitutes a necessary step toward systematic, standardized evaluation in question answering over knowledge graphs and facilitates deeper diagnostic insights into system behavior beyond answer correctness.
Abstract（参考訳）: 知識グラフに対する質問応答(QA)システムの評価は、歴史的に断片化、矛盾、再現性に悩まされてきた。セマンティック解析とSPARQLクエリ生成において大きな進歩があったが、評価手法は多様であり、アドホックであり、研究全体では相容れないことが多い。既存のベンチマークでは、クエリの正確な一致や回答レベルのF1、構文的妥当性、意味的忠実性、実行の正確性、結果のランク付け品質、計算効率など、いくつかの指標に重点を置いている。本稿では,SPARQLクエリ比較と実行ベースアセスメントに特化して設計されたオープンソースで拡張性があり,統一された評価ライブラリであるt2s-metricsを提案する。 t2s-metricsは、文学的、構文的、意味論的、構造的、実行ベース、ランキングに基づく、20以上の評価指標の広範かつ拡張可能なセットを提供する。トークンレベルの精度、リコール、F1、BLEU、ROUGE、METEOR、CodeBLEUなどのクエリベースのメトリクス、変数正規化メトリクス(SP-BLEU、SP-F1)、グラフおよびURIベースの正確なマッチングメトリクス、F1-QALDやJaccardのような回答セットベースのメトリクス、MRR、NDCG、P@k、Hit@kといったランキングメトリクス、LLM-as-a-Judgeメトリクスなどである。情報検索のためのir-metricsライブラリからインスピレーションを得て、t2s-metricsは、実装からメトリック仕様を分離し、SPARQLベースのQAシステムの一貫性、透過性、再現性を備えた評価を可能にするモジュラー抽象化層を提供する。我々は,t2s-metricsが知識グラフよりも解答の体系的,標準化された評価に向けた必要なステップであり,解答の正確性を超えたシステム行動のより深い診断的洞察を促進することを論じている。

論文の概要: T2S-Metrics: Unified Library for Evaluating SPARQL Queries Generated From Natural Language

関連論文リスト