Fugu-MT 論文翻訳(概要): Not All That Is Fluent Is Factual: Investigating Hallucinations of Large Language Models in Academic Writing

論文の概要: Not All That Is Fluent Is Factual: Investigating Hallucinations of Large Language Models in Academic Writing

arxiv url: http://arxiv.org/abs/2605.04171v1
Date: Tue, 05 May 2026 18:08:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-07 18:41:07.474933
Title: Not All That Is Fluent Is Factual: Investigating Hallucinations of Large Language Models in Academic Writing
Title（参考訳）: 学術論文における大規模言語モデルの幻覚を探る
Authors: Humam Khan, Md Tabrez Nafis, Shahab Saquib Sohail, Aqeel Khalique, Rehan Hasan Khan,
Abstract要約: 大型言語モデル(LLM)は異常な能力を示すが、幻覚を起こす傾向にある。本研究は,4つのLLM(ChatGPT,Grok,Gemini,Copilot)について,特に学術著作の幻覚について検討した。
参考スコア（独自算出の注目度）: 0.6783367407525908
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large Language models (LLMs) show extraordinary abilities, but they are still prone to hallucinations, especially when we use them for generating Academic content. We have investigated four popular LLMs, ChatGPT, Grok, Gemini, and Copilot for hallucinations specifically for academic writing. We have designed 80 prompts across four categories, namely, reference generation, factual explanation, abstract generation, and writing improvement. We evaluated the model using a 0-5 rubric score, which checks factual accuracy, reference validity, coherence, style consistency, and academic tone. A novel weighted metric, Hallucination Index (HI), was introduced to measure hallucination in the responses generated by the models. Some of the most widely used evaluation metrics often fail to check errors which alter sentiment in machine-translated text. We found that Grok and Copilot perform better on reference generation tasks, but they often struggle with abstract or stylistic prompts, with HI values of 0.67 and 0.70, respectively. Whereas, Gemini and ChatGPT have done well with having stronger tone control, but they lack in writing factual tasks and higher hallucination risk with HI scores of 0.53 and 0.57, respectively. Our study found that hallucination behavior does not depend solely on model architecture but also on the type of task and the prompting conditions we are providing. We propose that our work opens new research dimensions for future researchers.
Abstract（参考訳）: 大規模言語モデル(LLM)は、素晴らしい能力を示すが、それでも幻覚を起こす傾向にある。本研究は,4つのLLM(ChatGPT,Grok,Gemini,Copilot)について,特に学術著作の幻覚について検討した。我々は、参照生成、事実説明、抽象生成、書き込み改善という4つのカテゴリにまたがる80のプロンプトを設計した。実測精度,基準妥当性,コヒーレンス,スタイル整合性,アカデミックトーンを0-5ルーブリックスコアを用いて評価した。ハロシン化指数(Halucination Index, HI)は、モデルが生成した反応の幻覚を測定するために導入された。最も広く使われている評価指標のいくつかは、機械翻訳されたテキストの感情を変えるエラーをチェックするのに失敗することが多い。我々は、GrokとCopilotが参照生成タスクでより優れていることを発見したが、それらは抽象的またはスタイリスティックなプロンプトと、それぞれ0.67と0.70のHI値で苦労することが多い。一方、GeminiとChatGPTはより強いトーンコントロールでうまく機能しているが、実際のタスクやHIスコア0.53と0.57の高い幻覚リスクは欠如している。本研究により,幻覚行動はモデルアーキテクチャだけでなく,課題の種類や,我々が提供している刺激的条件にも依存することがわかった。我々は,今後の研究者に新たな研究領域を開くことを提案する。

論文の概要: Not All That Is Fluent Is Factual: Investigating Hallucinations of Large Language Models in Academic Writing

関連論文リスト