Fugu-MT 論文翻訳(概要): Evaluating the Factual Consistency of Large Language Models Through News Summarization

論文の概要: Evaluating the Factual Consistency of Large Language Models Through News Summarization

arxiv url: http://arxiv.org/abs/2211.08412v2
Date: Sat, 2 Dec 2023 18:10:15 GMT
ステータス: 翻訳完了
システム内更新日: 2023-12-06 01:51:47.472180
Title: Evaluating the Factual Consistency of Large Language Models Through News Summarization
Title（参考訳）: ニュース要約による大規模言語モデルの事実整合性の評価
Authors: Derek Tam, Anisha Mascarenhas, Shiyue Zhang, Sarah Kwan, Mohit Bansal, Colin Raffel
Abstract要約: 本稿では,要約タスクに着目したFIB(Factual Inconsistency Benchmark)と呼ばれる新しいベンチマークを提案する。現実的に一貫した要約では、手作業で事実的に一貫したものとして検証する、人書きの参照要約を使用します。現実的に矛盾しない要約に対して、我々は、事実的に矛盾しているとして手動で注釈付けした一連の要約モデルから要約を生成する。
参考スコア（独自算出の注目度）: 97.04685401448499
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While large language models (LLMs) have proven to be effective on a large variety of tasks, they are also known to hallucinate information. To measure whether an LLM prefers factually consistent continuations of its input, we propose a new benchmark called FIB(Factual Inconsistency Benchmark) that focuses on the task of summarization. Specifically, our benchmark involves comparing the scores an LLM assigns to a factually consistent versus a factually inconsistent summary for an input news article. For factually consistent summaries, we use human-written reference summaries that we manually verify as factually consistent. To generate summaries that are factually inconsistent, we generate summaries from a suite of summarization models that we have manually annotated as factually inconsistent. A model's factual consistency is then measured according to its accuracy, i.e.\ the proportion of documents where it assigns a higher score to the factually consistent summary. To validate the usefulness of FIB, we evaluate 23 large language models ranging from 1B to 176B parameters from six different model families including BLOOM and OPT. We find that existing LLMs generally assign a higher score to factually consistent summaries than to factually inconsistent summaries. However, if the factually inconsistent summaries occur verbatim in the document, then LLMs assign a higher score to these factually inconsistent summaries than factually consistent summaries. We validate design choices in our benchmark including the scoring method and source of distractor summaries. Our code and benchmark data can be found at https://github.com/r-three/fib.
Abstract（参考訳）: 大規模言語モデル (LLM) は様々なタスクに有効であることが証明されているが、情報の幻覚としても知られている。 LLMが実際に一貫した入力継続を好むかどうかを測定するために,要約の課題に焦点を当てたFIB(Factual Inconsistency Benchmark)と呼ばれる新しいベンチマークを提案する。具体的には、LLMが割り当てるスコアを事実整合性に比較し、入力ニュース記事の事実整合性に欠ける要約と比較する。現実的に一貫した要約では、手動で事実的に一貫した検証を行う人間による参照要約を用いる。現実的に矛盾しない要約を生成するために,手動で注釈付けした一連の要約モデルから要約を生成する。モデルの事実整合性は、その正確さ、すなわち、事実に一貫性のある要約により高いスコアを割り当てる文書の割合に応じて測定される。 FIBの有用性を検証するため,BLOOM や OPT を含む6種類のモデルファミリから 1B から 176B までの大言語モデルを評価した。既存のLCMは、現実的に一貫性のない要約よりも、現実的に一貫性のない要約に高いスコアを割り当てている。しかし、事実的に一貫性のない要約が文書の中で冗長な場合、LCMは事実的に一貫性のない要約よりも高いスコアをこれらの事実的に一貫性のない要約に割り当てる。我々は,スコアリング方法と邪魔者サマリーのソースを含むベンチマークで設計選択を検証する。コードとベンチマークデータはhttps://github.com/r-three/fib.com/で確認できます。

論文の概要: Evaluating the Factual Consistency of Large Language Models Through News Summarization

関連論文リスト