Fugu-MT 論文翻訳(概要): TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks

論文の概要: TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks

arxiv url: http://arxiv.org/abs/2310.00752v3
Date: Sat, 9 Dec 2023 22:39:53 GMT
ステータス: 翻訳完了
システム内更新日: 2023-12-12 22:18:59.808530
Title: TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks
Title（参考訳）: TIGERScore:すべてのテキスト生成タスクのための説明可能なメトリクスの構築を目指して
Authors: Dongfu Jiang, Yishan Li, Ge Zhang, Wenhao Huang, Bill Yuchen Lin, Wenhu Chen
Abstract要約: TIGERScore は textbfInstruction textbfGuidance に従って textbfExplainable および textbfReference-free 評価を行う。我々のメトリクスは、厳密にキュレートされた命令チューニングデータセット MetricInstruct に基づいて訓練された LLaMA-2 に基づいている。
参考スコア（独自算出の注目度）: 47.47263952401252
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present TIGERScore, a \textbf{T}rained metric that follows \textbf{I}nstruction \textbf{G}uidance to perform \textbf{E}xplainable, and \textbf{R}eference-free evaluation over a wide spectrum of text generation tasks. Different from other automatic evaluation methods that only provide arcane scores, TIGERScore is guided by natural language instruction to provide error analysis to pinpoint the mistakes in the generated text. Our metric is based on LLaMA-2, trained on our meticulously curated instruction-tuning dataset MetricInstruct which covers 6 text generation tasks and 23 text generation datasets. The dataset consists of 42K quadruple in the form of (instruction, input, system output $\rightarrow$ error analysis). We collected the `system outputs' through from a large variety of models to cover different types of errors. To quantitatively assess our metric, we evaluate its correlation with human ratings on 5 held-in datasets, 2 held-out datasets and show that TIGERScore can achieve the open-source SoTA correlation with human ratings across these datasets and almost approaches GPT-4 evaluator. As a reference-free metric, its correlation can even surpass the best existing reference-based metrics. To further qualitatively assess the rationale generated by our metric, we conduct human evaluation on the generated explanations and found that the explanations are 70.8\% accurate. Through these experimental results, we believe TIGERScore demonstrates the possibility of building universal explainable metrics to evaluate any text generation task.
Abstract（参考訳）: 本稿では,テキスト生成タスクの幅広い範囲において,textbf{I}nstruction \textbf{G}uidance を用いて,textbf{E}xplainable および \textbf{R}eference-free 評価を行う。アークーンスコアのみを提供する他の自動評価方法とは異なり、TIGERScoreは自然言語命令によって誘導され、生成されたテキストの誤りをピンポイントするエラー解析を提供する。 LLaMA-2は6つのテキスト生成タスクと23のテキスト生成データセットをカバーする命令チューニングデータセットである。データセットは42K四重項からなる(命令、入力、システム出力$\rightarrow$エラー解析)。さまざまなタイプのエラーをカバーするために,多種多様なモデルから‘システム出力’を収集した。評価基準を定量的に評価するため、5つのホールドインデータセット、2つのホールドアウトデータセットの人格評価との相関を評価し、TIGERScoreがこれらのデータセットの人格評価とオープンソースSoTA相関を達成でき、GPT-4評価にほぼ近づいたことを示す。基準のない計量として、その相関は既存の基準ベースの最高の指標を超えうる。さらに,本測定で得られた理論的根拠を定性的に評価するために,生成された説明について人間による評価を行い,その説明が70.8\%正確であることを見出した。これらの実験結果を通じて、TIGERScoreは、任意のテキスト生成タスクを評価する普遍的な説明可能なメトリクスを構築する可能性を実証している。

論文の概要: TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks

関連論文リスト