Fugu-MT 論文翻訳(概要): FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets

論文の概要: FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets

arxiv url: http://arxiv.org/abs/2307.10928v1
Date: Thu, 20 Jul 2023 14:56:35 GMT
ステータス: 翻訳完了
システム内更新日: 2023-07-21 12:28:29.217236
Title: FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets
Title（参考訳）: FLASK:アライメントスキルセットに基づくきめ細かい言語モデルの評価
Authors: Seonghyeon Ye, Doyoung Kim, Sungdong Kim, Hyeonbin Hwang, Seungone Kim, Yongrae Jo, James Thorne, Juho Kim, Minjoon Seo
Abstract要約: FLASKは、粗度スコアリングをインスタンス単位のスキルセットレベルに分解する、きめ細かい評価プロトコルである。具体的には、LLMがオープンエンドユーザー指示に従うために必要な12のきめ細かいスキルを定義する。 FLASKは、スキル、ドメイン、難易度に応じて、モデルのパフォーマンスを包括的に分析した総合的なビューを提供する。
参考スコア（独自算出の注目度）: 39.83660394323222
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluation of Large Language Models (LLMs) is challenging because aligning to human values requires the composition of multiple skills and the required set of skills varies depending on the instruction. Recent studies have evaluated the performance of LLMs in two ways, (1) automatic evaluation on several independent benchmarks and (2) human or machined-based evaluation giving an overall score to the response. However, both settings are coarse-grained evaluations, not considering the nature of user instructions that require instance-wise skill composition, which limits the interpretation of the true capabilities of LLMs. In this paper, we introduce FLASK (Fine-grained Language Model Evaluation based on Alignment SKill Sets), a fine-grained evaluation protocol that can be used for both model-based and human-based evaluation which decomposes coarse-level scoring to an instance-wise skill set-level. Specifically, we define 12 fine-grained skills needed for LLMs to follow open-ended user instructions and construct an evaluation set by allocating a set of skills for each instance. Additionally, by annotating the target domains and difficulty level for each instance, FLASK provides a holistic view with a comprehensive analysis of a model's performance depending on skill, domain, and difficulty. Through using FLASK, we compare multiple open-sourced and proprietary LLMs and observe highly-correlated findings between model-based and human-based evaluations. FLASK enables developers to more accurately measure the model performance and how it can be improved by analyzing factors that make LLMs proficient in particular skills. For practitioners, FLASK can be used to recommend suitable models for particular situations through comprehensive comparison among various LLMs. We release the evaluation data and code implementation at https://github.com/kaistAI/FLASK.
Abstract（参考訳）: 大規模言語モデル(LLM)の評価は、人的価値に合わせるには、複数のスキルの構成が必要であり、必要なスキルセットは命令によって異なるため、難しい。最近の研究では,(1)複数の独立ベンチマークの自動評価,(2)反応に対する総合スコアを与える人間または機械による評価,の2つの方法でllmの性能評価を行っている。しかし、どちらの設定も大まかな評価であり、LLMの真の能力の解釈を制限するインスタンスワイドなスキル構成を必要とするユーザ命令の性質を考慮しない。本稿では,粗粒度スコアリングをインスタンス毎のスキルセットレベルに分解するモデルベースとヒューマンベースの両方に適用可能な,粒度評価プロトコルであるflask(粒度言語モデル評価,アライメントスキルセットに基づく粒度言語モデル評価)を提案する。具体的には、LLMがオープンエンドのユーザ指示に従うために必要な12のきめ細かいスキルを定義し、各インスタンスのスキルセットを割り当てて評価セットを構築する。さらに、各インスタンスのターゲットドメインと難易度をアノテートすることで、FLASKは、スキル、ドメイン、難易度に応じて、モデルのパフォーマンスを包括的に分析する全体像を提供する。 FLASKを用いて、複数のオープンソースおよびプロプライエタリなLCMを比較し、モデルに基づく評価と人間による評価の高度に相関した結果を観察する。 FLASKを使うことで、開発者はモデルのパフォーマンスをより正確に測定し、特定のスキルにおいてLLMを熟練させる要因を分析することで改善できる。実践者にとって、FLASKは様々なLLMの総合的な比較を通じて、特定の状況に適したモデルを提案するために使用できる。評価データとコード実装はhttps://github.com/kaistAI/FLASK.comで公開します。

論文の概要: FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets

関連論文リスト