Fugu-MT 論文翻訳(概要): Grounding Arabic LLMs in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith

論文の概要: Grounding Arabic LLMs in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith

arxiv url: http://arxiv.org/abs/2603.23972v1
Date: Wed, 25 Mar 2026 06:09:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-26 21:06:11.155112
Title: Grounding Arabic LLMs in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith
Title（参考訳）: ドーハ歴史辞典におけるアラビア語のLLMの接地 : クァランとハディスの検索・拡張的理解
Authors: Somaya Eltanbouly, Samer Rashwani,
Abstract要約: ダイアクロニック・レキソグラフィーの知識を基盤とした検索強化型生成フレームワークを開発した。汎用コーパスに依存する従来のRAGシステムとは異なり、我々の手法はドハ歴史辞典から証拠を回収する。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have achieved remarkable progress in many language tasks, yet they continue to struggle with complex historical and religious Arabic texts such as the Quran and Hadith. To address this limitation, we develop a retrieval-augmented generation (RAG) framework grounded in diachronic lexicographic knowledge. Unlike prior RAG systems that rely on general-purpose corpora, our approach retrieves evidence from the Doha Historical Dictionary of Arabic (DHDA), a large-scale resource documenting the historical development of Arabic vocabulary. The proposed pipeline combines hybrid retrieval with an intent-based routing mechanism to provide LLMs with precise, contextually relevant historical information. Our experiments show that this approach improves the accuracy of Arabic-native LLMs, including Fanar and ALLaM, to over 85\%, substantially reducing the performance gap with Gemini, a proprietary large-scale model. Gemini also serves as an LLM-as-a-judge system for automatic evaluation in our experiments. The automated judgments were verified through human evaluation, demonstrating high agreement (kappa = 0.87). An error analysis further highlights key linguistic challenges, including diacritics and compound expressions. These findings demonstrate the value of integrating diachronic lexicographic resources into retrieval-augmented generation frameworks to enhance Arabic language understanding, particularly for historical and religious texts. The code and resources are publicly available at: https://github.com/somayaeltanbouly/Doha-Dictionary-RAG.
Abstract（参考訳）: 大型言語モデル (LLM) は多くの言語タスクにおいて顕著な進歩を遂げているが、クアン語やハディス語のような複雑な歴史的・宗教的アラビア語のテキストと格闘し続けている。この制限に対処するため、ダイアクロニック・レキソグラフィーの知識を基盤とした検索強化世代(RAG)フレームワークを開発した。汎用コーパスに依存する従来のRAGシステムとは異なり、我々のアプローチはアラビア語の歴史的発展を示す大規模な資料であるドハ歴史辞典(DHDA)から証拠を回収する。提案したパイプラインは、ハイブリッド検索とインテントベースのルーティング機構を組み合わせることで、LLMに正確な、文脈的に関係のある歴史的情報を提供する。実験の結果,Fanar や ALLaM などアラビア原産 LLM の精度は 85% 以上に向上し,プロプライエタリな大規模モデルである Gemini による性能ギャップを大幅に減らした。 Geminiは、我々の実験における自動評価のためのLCM-as-a-judgeシステムとしても機能する。自動判定は人的評価によって検証され,高い一致(カッパ=0.87)を示した。エラー分析は、ダイアクリティカルティクスや複合表現など、重要な言語的課題をさらに強調する。これらの知見は、特に歴史的・宗教的な文献においてアラビア語理解を高めるために、ダイアクロニック・レキソグラフィー資源を検索強化世代フレームワークに統合する価値を示している。コードとリソースは、https://github.com/somayaeltanbouly/Doha-Dictionary-RAGで公開されている。

論文の概要: Grounding Arabic LLMs in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith

関連論文リスト