Fugu-MT 論文翻訳(概要): PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization

論文の概要: PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization

arxiv url: http://arxiv.org/abs/2306.05087v2
Date: Fri, 24 May 2024 06:37:31 GMT
ステータス: 翻訳完了
システム内更新日: 2024-05-28 00:15:41.101284
Title: PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization
Title（参考訳）: PandaLM: LLM命令チューニング最適化のための自動評価ベンチマーク
Authors: Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, Wei Ye, Shikun Zhang, Yue Zhang,
Abstract要約: 審査用大言語モデル PandaLM は、いくつかの大きな言語モデルが与えられた優れたモデルを区別するために訓練されている。 PandaLMは、相対的簡潔さ、明快さ、指示への固執、包括性、形式性などの重要な主観的要因に対処する。 PandaLMはGPT-3.5の評価能力の93.75%、テストデータセットのF1スコアの88.28%を達成している。
参考スコア（独自算出の注目度）: 63.55408755562274
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Instruction tuning large language models (LLMs) remains a challenging task, owing to the complexity of hyperparameter selection and the difficulty involved in evaluating the tuned models. To determine the optimal hyperparameters, an automatic, robust, and reliable evaluation benchmark is essential. However, establishing such a benchmark is not a trivial task due to the challenges associated with evaluation accuracy and privacy protection. In response to these challenges, we introduce a judge large language model, named PandaLM, which is trained to distinguish the superior model given several LLMs. PandaLM's focus extends beyond just the objective correctness of responses, which is the main focus of traditional evaluation datasets. It addresses vital subjective factors such as relative conciseness, clarity, adherence to instructions, comprehensiveness, and formality. To ensure the reliability of PandaLM, we collect a diverse human-annotated test dataset, where all contexts are generated by humans and labels are aligned with human preferences. Our results indicate that PandaLM-7B achieves 93.75% of GPT-3.5's evaluation ability and 88.28% of GPT-4's in terms of F1-score on our test dataset. PandaLM enables the evaluation of LLM to be fairer but with less cost, evidenced by significant improvements achieved by models tuned through PandaLM compared to their counterparts trained with default Alpaca's hyperparameters. In addition, PandaLM does not depend on API-based evaluations, thus avoiding potential data leakage. All resources of PandaLM are released at https://github.com/WeOpenML/PandaLM.
Abstract（参考訳）: 大規模言語モデル(LLM)のチューニングは、ハイパーパラメータ選択の複雑さと調整モデルの評価の難しさのため、依然として難しい課題である。最適なハイパーパラメータを決定するためには、自動的、堅牢で信頼性の高い評価ベンチマークが不可欠である。しかし、評価精度とプライバシ保護に関わる課題のため、そのようなベンチマークを確立することは簡単な作業ではない。これらの課題に応えて,複数のLLMが与えられた優れたモデルを識別する訓練を施した,PandaLMという判断用大言語モデルを導入する。 PandaLMの焦点は、従来の評価データセットの主な焦点である応答の客観的な正しさに留まらない。相対的簡潔さ、明快さ、指示への固執、包括性、形式性などの重要な主観的要因に対処する。 PandaLMの信頼性を確保するために、我々は、人間によって生成されたすべてのコンテキストとラベルが人間の嗜好に合致する多様な人間アノテーションテストデータセットを収集する。 PandaLM-7BはGPT-3.5の評価能力の93.75%、テストデータセットのF1スコアの88.28%を達成している。 PandaLMは、デフォルトのアルパカのハイパーパラメーターで訓練されたモデルと比較して、PandaLMによって調整されたモデルによって達成された顕著な改善により、LCMの評価をより公平に、低コストで行えるようにした。さらに、PandaLMはAPIベースの評価に依存しないので、潜在的なデータ漏洩を回避することができる。 PandaLMのすべてのリソースはhttps://github.com/WeOpenML/PandaLMで公開されている。

論文の概要: PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization

関連論文リスト