Fugu-MT 論文翻訳(概要): UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

論文の概要: UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

arxiv url: http://arxiv.org/abs/2606.07167v1
Date: Fri, 05 Jun 2026 11:35:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-08 14:33:29.710517
Title: UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding
Title（参考訳）: UrduMMLU:ウルドゥー語理解のための大規模マルチタスクベンチマーク
Authors: Ahmer Tabassum, Sarfraz Ahmad, Hasan Iqbal, Owais Aijaz, Momina Ahsan, Preslav Nakov,
Abstract要約: Urduは2億3000万人以上の人々に話されており、ネイティブ教育ソースから構築された幅広いMMLUスタイルのベンチマークを欠いている。 UrduMMLUは、標準学術科目とウルドゥー語と地域特化科目の両方をカバーしている。 Gemini-3.5-Flash は 90.20% と 90.34% に達し、他のモデルは 85% を超えない。
参考スコア（独自算出の注目度）: 39.467132290959825
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Meaningful multilingual evaluation must test models in the target language and educational context. Urdu, spoken by more than 230 million people, lacks a broad MMLU-style benchmark built from native educational sources. We introduce UrduMMLU, a benchmark of 26,431 Urdu MCQs across 26 subjects and five domains, collected from native Urdu MCQ banks and public examination PDFs. Unlike translation-based resources, UrduMMLU covers both standard academic subjects and Urdu- and region-specific content. We label the exam-derived portion through dual human annotation with strict consensus filtering. We evaluate 30 LLMs under English and Urdu prompts, yielding 60 zero-shot evaluations, and further evaluate four open-source LLMs under multiple few-shot settings across both prompt languages. Gemini-3.5-Flash performs best, reaching 90.20% and 90.34% accuracy, while no other model exceeds 85%. The strongest open-source model trails by 7.79 and 8.92 points, and many models lose 25 to 40 points on Urdu-centered Humanities subjects compared with STEM. Few-shot prompting yields only modest gains. UrduMMLU shows that Urdu knowledge remains uneven in current LLMs, especially for regionally grounded content.
Abstract（参考訳）: 意味のある多言語評価は、対象言語と教育文脈でモデルをテストする必要がある。 Urduは2億3000万人以上の人々に話されており、ネイティブ教育ソースから構築された幅広いMMLUスタイルのベンチマークを欠いている。我々は、26の被験者と5つのドメインにわたる26,431のUrdu MCQのベンチマークであるUrduMMLUを紹介し、ネイティブなUrdu MCQ銀行および公開試験用PDFから収集した。翻訳ベースのリソースとは異なり、UrduMMLUは標準の学術科目とUrdu固有の内容の両方をカバーしている。厳密なコンセンサスフィルタリングによる二重アノテーションを用いて、試験に由来する部分をラベル付けする。我々は、英語とウルドゥー語で30のLLMを評価し、60のゼロショット評価を行い、さらに2つのプロンプト言語で複数回の複数ショット設定で4つのオープンソースLLMを評価した。 Gemini-3.5-Flash は 90.20% と 90.34% に達し、他のモデルは 85% を超えない。最も強力なオープンソースモデルは7.79ポイントと8.92ポイントで、多くのモデルはSTEMと比較してウルドゥー中心の人文科学の被験者で25から40ポイントを失う。少ないショットのプロンプトは、わずかに利得しか得られない。 UrduMMLU は、ウルドゥーの知識が現在の LLM において不均一であることを示している。

論文の概要: UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

関連論文リスト