Fugu-MT 論文翻訳(概要): NLPBench: Evaluating Large Language Models on Solving NLP Problems

論文の概要: NLPBench: Evaluating Large Language Models on Solving NLP Problems

arxiv url: http://arxiv.org/abs/2309.15630v4
Date: Thu, 19 Oct 2023 05:58:31 GMT
ステータス: 翻訳完了
システム内更新日: 2023-10-20 19:08:40.378892
Title: NLPBench: Evaluating Large Language Models on Solving NLP Problems
Title（参考訳）: NLPBench: NLP問題を解決するための大規模言語モデルの評価
Authors: Linxin Song, Jieyu Zhang, Lechao Cheng, Pengyuan Zhou, Tianyi Zhou, Irene Li
Abstract要約: 大規模言語モデル(LLM)は、自然言語処理(NLP)の能力を高めることを約束している。イェール大学の最終試験から得られた様々なNLPトピックにまたがる378の大学レベルのNLP質問を含む,ユニークなベンチマークデータセットであるNLPBenchを提案する。 GPT-3.5/4, PaLM-2, LLAMA-2などのLCMに着目した評価では, チェーン・オブ・シークレット(CoT)やツリー・オブ・シークレット(ToT)といった先進的なプロンプト戦略が取り入れられている。
参考スコア（独自算出の注目度）: 41.01588131136101
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent developments in large language models (LLMs) have shown promise in enhancing the capabilities of natural language processing (NLP). Despite these successes, there remains a dearth of research dedicated to the NLP problem-solving abilities of LLMs. To fill the gap in this area, we present a unique benchmarking dataset, NLPBench, comprising 378 college-level NLP questions spanning various NLP topics sourced from Yale University's prior final exams. NLPBench includes questions with context, in which multiple sub-questions share the same public information, and diverse question types, including multiple choice, short answer, and math. Our evaluation, centered on LLMs such as GPT-3.5/4, PaLM-2, and LLAMA-2, incorporates advanced prompting strategies like the chain-of-thought (CoT) and tree-of-thought (ToT). Our study reveals that the effectiveness of the advanced prompting strategies can be inconsistent, occasionally damaging LLM performance, especially in smaller models like the LLAMA-2 (13b). Furthermore, our manual assessment illuminated specific shortcomings in LLMs' scientific problem-solving skills, with weaknesses in logical decomposition and reasoning notably affecting results.
Abstract（参考訳）: 近年の大規模言語モデル(LLM)の発展により,自然言語処理(NLP)の能力向上が期待されている。これらの成功にもかかわらず、LPMのNLP問題解決能力に関する多くの研究が続いている。この領域のギャップを埋めるために,イェール大学の最終試験から得られた様々なNLPトピックにまたがる378の大学レベルのNLP質問を含む,ユニークなベンチマークデータセットであるNLPBenchを提案する。 NLPBenchは、複数のサブクエストが同じ公開情報を共有し、複数の選択、短い答え、数学を含む多様な質問タイプを共有する、コンテキストを持った質問を含んでいる。 GPT-3.5/4, PaLM-2, LLAMA-2などのLCMを主軸として, チェーン・オブ・シークレット(CoT)やツリー・オブ・シークレット(ToT)といった先進的なプロンプト戦略を取り入れた評価を行った。本研究は, LLAMA-2 (13b) などの小型モデルにおいて, 先進的なプロンプト戦略の有効性が矛盾し, LLM性能を損なう可能性があることを示す。さらに,LLMの科学的問題解決技術に特有の欠点が指摘され,論理的分解や推論の弱点が顕著に影響した。

論文の概要: NLPBench: Evaluating Large Language Models on Solving NLP Problems

関連論文リスト