Fugu-MT 論文翻訳(概要): ReFACT: A Benchmark for Scientific Confabulation Detection with Positional Error Annotations

論文の概要: ReFACT: A Benchmark for Scientific Confabulation Detection with Positional Error Annotations

arxiv url: http://arxiv.org/abs/2509.25868v2
Date: Wed, 01 Oct 2025 04:57:04 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-02 14:33:21.834764
Title: ReFACT: A Benchmark for Scientific Confabulation Detection with Positional Error Annotations
Title（参考訳）: ReFACT: 位置誤りアノテーションを用いた科学的衝突検出のためのベンチマーク
Authors: Yindong Wang, Martin Preiß, Margarita Bugueño, Jan Vincent Hoffbauer, Abdullatif Ghajar, Tolga Buz, Gerard de Melo,
Abstract要約: 大規模言語モデル(LLM)は、しばしば科学的事実を議論し、その信頼性を著しく損なう。多様な科学的領域にまたがる1,001名の専門家による質問応答対のベンチマークであるReFACTを紹介する。それぞれのインスタンスには、科学的に正しい答えと、正確なエラースパンとエラータイプで注釈付けされた非実例の両方が含まれている。
参考スコア（独自算出の注目度）: 14.392598503431321
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) frequently confabulate scientific facts, severely undermining their trustworthiness. Addressing this challenge requires benchmarks that go beyond binary factuality and enable fine-grained evaluation. We introduce ReFACT (Reddit False And Correct Texts), a benchmark of 1,001 expert-annotated question-answer pairs spanning diverse scientific domains for the detection of scientific confabulation. Each instance includes both a scientifically correct answer and a non-factual counterpart annotated with precise error spans and error types. ReFACT enables multi-stage evaluation: (1) confabulation detection, (2) fine-grained error localization, and (3) correction. We benchmark 9 state-of-the-art LLMs, revealing limited performance (about 50 percent accuracy). Even top models such as GPT-4o fail to distinguish factual from confabulated scientific answers, raising concerns about the reliability of LLM-as-judge evaluation paradigms. Our findings highlight the need for fine-grained, human-validated benchmarks to detect and correct scientific confabulation in domain-specific contexts. The dataset is available at: https://github.com/ddz5431/ReFACT
Abstract（参考訳）: 大規模言語モデル(LLM)は、しばしば科学的事実を議論し、その信頼性を著しく損なう。この課題に対処するには、バイナリの事実性を超えて、きめ細かい評価を可能にするベンチマークが必要である。 ReFACT(Reddit False And Correct Texts)は, さまざまな科学的領域にまたがる, 1,001 名の専門家による質問応答対のベンチマークである。それぞれのインスタンスには、科学的に正しい答えと、正確なエラースパンとエラータイプで注釈付けされた非実例の両方が含まれている。 ReFACT は,(1) 衝突検出,(2) きめ細かい誤差の局所化,(3) 補正などの多段階評価を可能にする。我々は9つの最先端のLCMをベンチマークし、限られた性能(約50%の精度)を明らかにした。 GPT-4oのようなトップモデルでさえ、事実と議論された科学的回答を区別できず、LCM-as-judge評価パラダイムの信頼性に関する懸念を提起している。我々の研究は、ドメイン固有の文脈における科学的折り畳みの検出と修正のための、きめ細かい人為的なベンチマークの必要性を浮き彫りにしている。データセットは https://github.com/ddz5431/ReFACT

論文の概要: ReFACT: A Benchmark for Scientific Confabulation Detection with Positional Error Annotations

関連論文リスト