Fugu-MT 論文翻訳(概要): HalluScore: Large Language Model Hallucination Question Answering Benchmark

論文の概要: HalluScore: Large Language Model Hallucination Question Answering Benchmark

arxiv url: http://arxiv.org/abs/2605.17007v1
Date: Sat, 16 May 2026 14:08:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:47.416296
Title: HalluScore: Large Language Model Hallucination Question Answering Benchmark
Title（参考訳）: HalluScore: ベンチマークに回答する大規模言語モデル幻覚質問
Authors: Aisha Alansari, Hamzah Luqman,
Abstract要約: HalluScoreは、大規模言語モデルにおける幻覚行動を評価するために設計された構造化アラビア語質問応答ベンチマークである。 LLMの幻覚の評価、検出、緩和のための827の精査された質問を含んでいる。われわれは17のアラビア語・多言語・推論LLMの幻覚パターンを包括的に分析した。
参考スコア（独自算出の注目度）: 3.8100688074986095
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have achieved remarkable progress in natural language generation, but remain susceptible to hallucination. In response to growing concerns about hallucinations, several benchmarks have been developed, primarily in English and Chinese. However, Arabic remains underrepresented, with limited benchmarks for LLMs hallucination due to scarce annotated resources and the language's morphological complexity. Consequently, existing benchmarks do not adequately reflect the linguistic, cultural, and reasoning characteristics of Arabic. To address this gap, we introduce HalluScore, a structured Arabic question answering benchmark designed to evaluate hallucination behavior in LLMs across different levels of reasoning difficulty, various knowledge domains, historical timelines, and culturally grounded Arabic scenarios. It contains 827 carefully curated questions for evaluating, detecting, and mitigating hallucination in LLMs. The dataset was constructed through a structured pipeline involving quality assurance, filtering for clarity and factual validity, and model-driven selection to retain questions that consistently trigger hallucinations. Each question is linked to verified ground-truth evidence, answer explanations, and multi-label annotations. Using the HalluScore benchmark, we conduct a comprehensive empirical analysis of hallucination patterns across 17 Arabic, multilingual, and reasoning LLMs. Moreover, we provide high-quality human annotations identifying hallucinated, non-hallucinated, and partially hallucinated responses of all evaluated LLMs. These results suggest that hallucination in Arabic LLMs extends beyond factual inaccuracies, encompassing challenges related to cultural understanding, linguistic reasoning, and logical consistency. We release HalluScore to support future research on improving the reliability and cultural competence of LLMs in Arabic.
Abstract（参考訳）: 大規模言語モデル(LLM)は、自然言語生成において顕著な進歩を遂げているが、幻覚の影響を受けやすいままである。幻覚に対する懸念が高まる中で、主に英語と中国語でいくつかのベンチマークが開発されている。しかし、アラビア語はいまだに不足しており、注記資源の不足と言語の形態的複雑さによるLLMの幻覚のベンチマークは限られている。既存のベンチマークは、アラビア語の言語的、文化的、理性的な特徴を十分に反映していない。このギャップに対処するために、さまざまな推論難易度、様々な知識領域、歴史的タイムライン、文化的に根ざしたアラビアのシナリオにまたがるLLMにおける幻覚行動を評価するために設計された構造化アラビア語質問応答ベンチマークであるHauScoreを紹介した。 LLMの幻覚の評価、検出、緩和のための827の精査された質問を含んでいる。データセットは、品質保証、明瞭さと事実の有効性のフィルタリング、そして幻覚を常に引き起こす質問を維持するためのモデル駆動の選択を含む構造化パイプラインによって構築された。各質問は、検証済みの土台真理証拠、回答説明、マルチラベルアノテーションに関連付けられている。 HalluScoreベンチマークを用いて、17のアラビア語、多言語、推論LLMの幻覚パターンを包括的に分析する。さらに、評価された全てのLDMの幻覚的、非幻覚的、部分的に幻覚的応答を識別する高品質な人間のアノテーションを提供する。これらの結果は、アラビア語のLLMにおける幻覚は、文化的理解、言語的推論、論理的整合性に関連する課題を含む、事実的不正確性を超えて広がることを示唆している。アラビア語におけるLLMの信頼性と文化能力の向上に関する今後の研究を支援するため、HaluScoreをリリースする。

論文の概要: HalluScore: Large Language Model Hallucination Question Answering Benchmark

関連論文リスト