Fugu-MT 論文翻訳(概要): Halluverse-M^3: A multitask multilingual benchmark for hallucination in LLMs

論文の概要: Halluverse-M^3: A multitask multilingual benchmark for hallucination in LLMs

arxiv url: http://arxiv.org/abs/2602.06920v1
Date: Fri, 06 Feb 2026 18:16:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-09 22:18:26.519914
Title: Halluverse-M^3: A multitask multilingual benchmark for hallucination in LLMs
Title（参考訳）: Halluverse-M^3: LLMにおける幻覚のためのマルチタスク多言語ベンチマーク
Authors: Samir Abdaljalil, Parichit Sharma, Erchin Serpedin, Hasan Kurban,
Abstract要約: Halluverse-M3は、複数の言語にまたがる幻覚の体系的な分析を可能にするデータセットである。データセットは、エンティティレベル、関係レベル、および文レベルの幻覚を明確に区別する。 Halluverse-M3は、多言語、マルチタスク設定で幻覚を研究するための現実的で挑戦的なベンチマークを提供する。
参考スコア（独自算出の注目度）: 2.453830698820308
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Hallucinations in large language models remain a persistent challenge, particularly in multilingual and generative settings where factual consistency is difficult to maintain. While recent models show strong performance on English-centric benchmarks, their behavior across languages, tasks, and hallucination types is not yet well understood. In this work, we introduce Halluverse-M^3, a dataset designed to enable systematic analysis of hallucinations across multiple languages, multiple generation tasks, and multiple hallucination categories. Halluverse-M^3 covers four languages, English, Arabic, Hindi, and Turkish, and supports two generation tasks: question answering and dialogue summarization. The dataset explicitly distinguishes between entity-level, relation-level, and sentence-level hallucinations. Hallucinated outputs are constructed through a controlled editing process and validated by human annotators, ensuring clear alignment between original content and hallucinated generations. Using this dataset, we evaluate a diverse set of contemporary open-source and proprietary language models on fine-grained hallucination detection. Our results show that question answering is consistently easier than dialogue summarization, while sentence-level hallucinations remain challenging even for the strongest models. Performance is highest in English and degrades in lower-resource languages, with Hindi exhibiting the lowest detection accuracy. Overall, Halluverse-M^3 provides a realistic and challenging benchmark for studying hallucinations in multilingual, multi-task settings. We release the dataset to support future research on hallucination detection and mitigation\footnote{https://huggingface.co/datasets/sabdalja/HalluVerse-M3}.
Abstract（参考訳）: 大規模言語モデルにおける幻覚は、特に事実整合性の維持が困難である多言語および生成的環境において、永続的な課題である。最近のモデルでは、英語中心のベンチマークでは高いパフォーマンスを示しているが、言語、タスク、幻覚のタイプ間での振る舞いはまだよく分かっていない。本研究では,複数の言語にまたがる幻覚の系統的解析,複数生成タスク,複数の幻覚カテゴリを実現するためのデータセットであるHaluverse-M^3を紹介する。 Halluverse-M^3は、英語、アラビア語、ヒンディー語、トルコ語の4つの言語をカバーし、質問応答と対話要約の2世代タスクをサポートしている。データセットは、エンティティレベル、関係レベル、および文レベルの幻覚を明確に区別する。幻覚出力は、制御された編集プロセスを通じて構築され、人間の注釈によって検証され、オリジナルコンテンツと幻覚世代との明確な整合性を確保する。本データセットを用いて,より微細な幻覚検出において,同時代のオープンソースおよびプロプライエタリ言語モデルの多種多様な集合を評価する。以上の結果から,最強モデルにおいても文レベルの幻覚は困難でありながら,対話の要約よりも質問応答が一貫して容易であることが示唆された。パフォーマンスは英語が最も高く、低リソース言語では劣化し、ヒンディー語では検出精度が最も低い。 Halluverse-M^3は、多言語、マルチタスク設定における幻覚を研究するための現実的で挑戦的なベンチマークを提供する。幻覚の検出と緩和に関する将来の研究を支援するデータセットをリリースする。

論文の概要: Halluverse-M^3: A multitask multilingual benchmark for hallucination in LLMs

関連論文リスト