Fugu-MT 論文翻訳(概要): Towards a Diagnostic and Predictive Evaluation Methodology for Sequence Labeling Tasks

論文の概要: Towards a Diagnostic and Predictive Evaluation Methodology for Sequence Labeling Tasks

arxiv url: http://arxiv.org/abs/2602.12759v1
Date: Fri, 13 Feb 2026 09:39:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-16 23:37:53.910922
Title: Towards a Diagnostic and Predictive Evaluation Methodology for Sequence Labeling Tasks
Title（参考訳）: シーケンスラベリングタスクの診断・予測評価手法
Authors: Elena Alvarez-Mellado, Julio Gonzalo,
Abstract要約: 本稿では,誤り解析に基づくシーケンスラベリングタスクの評価手法を提案する。本手法は, 外部データセットのモデル性能を0.85の正負相関で予測する。
参考スコア（独自算出の注目度）: 3.423332499970556
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Standard evaluation in NLP typically indicates that system A is better on average than system B, but it provides little info on how to improve performance and, what is worse, it should not come as a surprise if B ends up being better than A on outside data. We propose an evaluation methodology for sequence labeling tasks grounded on error analysis that provides both quantitative and qualitative information on where systems must be improved and predicts how models will perform on a different distribution. The key is to create test sets that, contrary to common practice, do not rely on gathering large amounts of real-world in-distribution scraped data, but consists in handcrafting a small set of linguistically motivated examples that exhaustively cover the range of span attributes (such as shape, length, casing, sentence position, etc.) a system may encounter in the wild. We demonstrate this methodology on a benchmark for anglicism identification in Spanish. Our methodology provides results that are diagnostic (because they help identify systematic weaknesses in performance), actionable (because they can inform which model is better suited for a given scenario) and predictive: our method predicts model performance on external datasets with a median correlation of 0.85.
Abstract（参考訳）: NLPの標準的な評価では、システムAはシステムBよりも平均的に優れているが、パフォーマンスを改善する方法についてはほとんど情報を提供していない。本稿では,システムの改善点に関する定量的および定性的な情報を提供し,異なる分布でモデルがどのように動作するかを予測する,誤り解析に基づくシーケンスラベルタスクの評価手法を提案する。鍵となるのは、一般的な慣行とは対照的に、大量の現実世界の散逸したデータを集めることに依存しないテストセットを作成することであるが、それは、システムが野生で遭遇する可能性のあるスパン属性(形状、長さ、ケーシング、文の位置など)の範囲を徹底的にカバーする、言語的に動機づけられた少数の例を手作りすることである。この手法をスペイン語のアングリシズム識別のためのベンチマークで実証する。我々の手法は、診断可能な結果(性能の体系的な弱点を特定するのに役立つため)、動作可能な結果(どのモデルが与えられたシナリオに適しているかを判断できるため)、予測可能な結果を提供する。

論文の概要: Towards a Diagnostic and Predictive Evaluation Methodology for Sequence Labeling Tasks

関連論文リスト