Fugu-MT 論文翻訳(概要): Fanar 2.0: Arabic Generative AI Stack

論文の概要: Fanar 2.0: Arabic Generative AI Stack

arxiv url: http://arxiv.org/abs/2603.16397v1
Date: Tue, 17 Mar 2026 11:35:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.24452
Title: Fanar 2.0: Arabic Generative AI Stack
Title（参考訳）: Fanar 2.0: アラビア語生成AIスタック
Authors: FANAR TEAM, Ummar Abbas, Mohammad Shahmeer Ahmad, Minhaj Ahmad, Abdulaziz Al-Homaid, Anas Al-Nuaimi, Enes Altinisik, Ehsaneddin Asgari, Sanjay Chawla, Shammur Chowdhury, Fahim Dalvi, Kareem Darwish, Nadir Durrani, Mohamed Elfeky, Ahmed Elmagarmid, Mohamed Eltabakh, Asim Ersoy, Masoomali Fatehkia, Mohammed Qusay Hashim, Majd Hawasly, Mohamed Hefeeda, Mus'ab Husaini, Keivin Isufaj, Soon-Gyo Jung, Houssam Lachemat, Ji Kim Lucas, Abubakr Mohamed, Tasnim Mohiuddin, Basel Mousi, Hamdy Mubarak, Ahmad Musleh, Mourad Ouzzani, Amin Sadeghi, Husrev Taha Sencar, Mohammed Shinoy, Omar Sinan, Yifan Zhang,
Abstract要約: Fanar 2.0は、カタールのアラビア中心のジェネレーティブAIプラットフォームの第2世代である。この取り組みは256のNVIDIA H100 GPUで実行され、4億人のネイティブスピーカーにもかかわらず、アラビア語のWebデータはわずか0.5%しかなかった。 Fanar 2.0は、量よりもデータ品質の規律ある戦略、ターゲットとなる事前トレーニング、モデルマージを採用して、実質的なゲインを実現している。
参考スコア（独自算出の注目度）: 25.9479146243898
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present Fanar 2.0, the second generation of Qatar's Arabic-centric Generative AI platform. Sovereignty is a first-class design principle: every component, from data pipelines to deployment infrastructure, was designed and operated entirely at QCRI, Hamad Bin Khalifa University. Fanar 2.0 is a story of resource-constrained excellence: the effort ran on 256 NVIDIA H100 GPUs, with Arabic having only ~0.5% of web data despite 400 million native speakers. Fanar 2.0 adopts a disciplined strategy of data quality over quantity, targeted continual pre-training, and model merging to achieve substantial gains within these constraints. At the core is Fanar-27B, continually pre-trained from a Gemma-3-27B backbone on a curated corpus of 120 billion high-quality tokens across three data recipes. Despite using 8x fewer pre-training tokens than Fanar 1.0, it delivers substantial benchmark improvements: Arabic knowledge (+9.1 pts), language (+7.3 pts), dialects (+3.5 pts), and English capability (+7.6 pts). Beyond the core LLM, Fanar 2.0 introduces a rich stack of new capabilities. FanarGuard is a state-of-the-art 4B bilingual moderation filter for Arabic safety and cultural alignment. The speech family Aura gains a long-form ASR model for hours-long audio. Oryx vision family adds Arabic-aware image and video understanding alongside culturally grounded image generation. An agentic tool-calling framework enables multi-step workflows. Fanar-Sadiq utilizes a multi-agent architecture for Islamic content. Fanar-Diwan provides classical Arabic poetry generation. FanarShaheen delivers LLM-powered bilingual translation. A redesigned multi-layer orchestrator coordinates all components through intent-aware routing and defense-in-depth safety validation. Taken together, Fanar 2.0 demonstrates that sovereign, resource-constrained AI development can produce systems competitive with those built at far greater scale.
Abstract（参考訳）: 私たちは、カタールのアラビア中心のジェネレーティブAIプラットフォームの第2世代であるFanar 2.0を紹介します。 Sovereigntyは、データパイプラインからデプロイメントインフラストラクチャまで、すべてのコンポーネントがQCRI(英語版)、ハマド・ビン・ハリファ大学(英語版)で設計・運用されている。 Fanar 2.0はリソース制限された卓越性に関する物語だ。この取り組みは256 NVIDIA H100 GPU上で実行され、ネイティブスピーカー4億人にもかかわらず、アラビアのWebデータはわずか0.5%だった。 Fanar 2.0は、量よりもデータ品質の規律的な戦略、目標とする継続的事前トレーニング、モデルマージを採用して、これらの制約の中でかなりの利益を達成している。コアとなるFanar-27Bは、Gemma-3-27Bのバックボーンから3つのデータレシピにまたがる1200億の高品質なトークンをキュレートしたコーパスで継続的に事前訓練されている。 Fanar 1.0より8倍少ない事前トレーニングトークンを使用しているにもかかわらず、アラビア語の知識(+9.1 pts)、言語(+7.3 pts)、方言(+3.5 pts)、英語の能力(+7.6 pts)など、かなりのベンチマーク改善がなされている。コア LLM の他に、Fanar 2.0 は豊富な新機能のスタックを導入している。 FanarGuardは、アラビア語の安全性と文化的アライメントのための最先端の4Bバイリンガルモデレーションフィルタである。音声ファミリーのAuraは、数時間のオーディオのための長めのASRモデルを取得する。 Oryx Visionファミリーは、文化的根拠のある画像生成とともに、アラビア語を意識した画像とビデオ理解を追加する。エージェントツール呼び出しフレームワークは、多ステップワークフローを可能にする。 Fanar-Sadiqは、イスラムコンテンツのためのマルチエージェントアーキテクチャを利用している。ファナー=ディワンは古典的なアラビア詩の世代を提供している。 FanarShaheenはLLMベースのバイリンガル翻訳を提供する。再設計された多層オーケストレータは、インテント・アウェア・ルーティングとディフェンス・イン・ディペンデンス・セーフ・バリデーションを通じて、すべてのコンポーネントをコーディネートする。 Fanar 2.0は、ソブリンでリソースに制約のあるAI開発が、はるかに大規模に構築されたシステムと競合するシステムを生み出すことを実証している。

論文の概要: Fanar 2.0: Arabic Generative AI Stack

関連論文リスト