Fugu-MT 論文翻訳(概要): Evaluation of Small Language Models for Arabic Language Processing

論文の概要: Evaluation of Small Language Models for Arabic Language Processing

arxiv url: http://arxiv.org/abs/2606.21460v1
Date: Fri, 19 Jun 2026 14:16:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-25 13:20:27.11961
Title: Evaluation of Small Language Models for Arabic Language Processing
Title（参考訳）: アラビア語処理のための小言語モデルの評価
Authors: Jumana Alsubhi, Ahmed Alhusayni, Abdulrahman Gharawi, Israa Hamdine, Alshaymaa Allahim, Lamees Alhumaid, Ahmad Shabana, Rafik Madani,
Abstract要約: この研究では、8つのドメインと10の言語スキルにまたがる240のアラビアテスト項目のベンチマークを紹介した。全てのモデルは、標準のアラビア文字のみのプロンプトテンプレートを使用して、制御されたゼロショット設定で評価された。より強いアラビアアライメントとより信頼性の高い指示追従行動を持つモデルは、タスク間でより良く機能する傾向にあった。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper evaluates the performance of twelve Small Language Models (SLMs) on Arabic natural language processing tasks. The study introduces a benchmark of 240 Arabic test items distributed across eight domains and ten language skills, covering both comprehension-oriented and generation-oriented tasks. All models were evaluated under a controlled zero-shot setting using a standardized Arabic-only prompt template. Model responses were assessed through a multi-model LLM-as-a-judge framework involving GPT-4.1 Mini, Claude Haiku 4.5, and DeepSeek-Chat, with scores aggregated across judges and analyzed by task, skill, and model family. The results show that Gemma 3 (12B) achieved the highest overall score (4.548/5), followed by Aya and C4AI Command Arabic. The observed results suggest that model size alone does not explain Arabic SLM performance. Models with stronger Arabic alignment and more reliable instruction-following behavior tended to perform better across tasks. Common failure patterns among lower-performing models include prompt leakage, hallucination, language drift, incomplete generation, and weak task adherence. Overall, the benchmark provides a structured reference for evaluating compact Arabic language models and supports future work on efficient, reliable, and culturally appropriate Arabic AI systems.
Abstract（参考訳）: 本稿では,アラビア語処理タスクにおけるSLM(Small Language Models)の性能評価を行う。この研究では、8つのドメインと10の言語スキルにまたがる240のアラビアテスト項目のベンチマークを紹介し、理解指向と世代指向の両方のタスクをカバーした。全てのモデルは、標準のアラビア文字のみのプロンプトテンプレートを使用して、制御されたゼロショット設定で評価された。モデル応答は、GPT-4.1 Mini、Claude Haiku 4.5、DeepSeek-Chatを含むマルチモデルLCM-as-a-judgeフレームワークを用いて評価された。その結果、Gemma 3 (12B) が最高スコア(4.548/5)を獲得し、Aya と C4AI Command アラビア語が続いた。その結果, モデルサイズだけではアラビアSLMの性能は説明できないことが示唆された。より強いアラビアアライメントとより信頼性の高い指示追従行動を持つモデルは、タスク間でより良く機能する傾向にあった。低パフォーマンスモデルに共通する障害パターンには、即時リーク、幻覚、言語ドリフト、不完全生成、弱いタスク順守などがある。全体として、このベンチマークは、コンパクトなアラビア言語モデルを評価するための構造化されたリファレンスを提供し、効率的で信頼性があり、文化的に適切なアラビアAIシステムに関する将来の作業をサポートする。

論文の概要: Evaluation of Small Language Models for Arabic Language Processing

関連論文リスト