Fugu-MT 論文翻訳(概要): Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs

論文の概要: Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs

arxiv url: http://arxiv.org/abs/2605.07731v1
Date: Fri, 08 May 2026 13:36:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:39.075168
Title: Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs
Title（参考訳）: EngGPT2-16B-A3Bの比較
Authors: Andrea Sassella, Andrea Chizzola, Tommaso Bianchi, Luca Alessandrelli, Mark James Carman,
Abstract要約: 本報告は、ENGINEERING Ingegneria S.p.A.のEngGPT2MoE-16B-A3B LLMの性能をベンチマークする。 3Bアクティブパラメータを持つ16BパラメータMixture of Experts(MoE)モデルである。
参考スコア（独自算出の注目度）: 1.4870079511598593
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This report benchmarks the performance of ENGINEERING Ingegneria Informatica S.p.A.'s EngGPT2MoE-16B-A3B LLM, a 16B parameter Mixture of Experts (MoE) model with 3B active parameters. Performance is investigated across a wide variety of representative benchmarks, and is compared against comparably-sized open-source MoE and dense models. In comparison with popular Italian models, namely FastwebMIIA-7B, Minerva-7B, Velvet-14B, and LLaMAntino-3-ANITA-8B, EngGPT2MoE-16B-A3B performs as well or better on international benchmarks: ARC-Challenge, GSM8K, AIME24, AIME25, MMLU, and HumanEval (HE). It achieves the best performance for the longest context setting (32k) of the RULER benchmark. On the Italian benchmark dataset ITALIC, the model performs as well or better than the other models except for Velvet-14B, which outperforms it. Compared with popular MoE models of comparable size, the new model reports higher values than DeepSeek-MoE-16B-Chat on all considered benchmarks. It has higher values than Moonlight-16B-A3B on HE, MMLU, AIME24, AIME25, GSM8K, and the 32k RULER setting, but lower on BFCL and some ARC and ITALIC settings. Finally it has lower values than GPT-OSS-20B on most benchmarks, including HE, MMLU, AIME24, AIME25, GSM8K, ARC, BFCL, and the RULER 32k. When compared with popular dense models, EngGPT2MoE-16B-A3B reports higher values on AIME24 and AIME25 than Llama-3.1-8B-Instruct, Gemma-3-12b-it, and Ministral-3-8BInstruct-2512-BF16, but lower values on ITALIC, BFCL, and RULER with a 32k context. When performance is aggregated across all benchmark metrics, EngGPT2MoE-16B-A3B shows higher performance than the Italian models under evaluation while achieving lower results than some of the most performant international models, in particular GPT-5 nano and Qwen3-8B. Taken together, our findings find the new model to be a step forward for native Italian Large Language Models.
Abstract（参考訳）: 本稿では,3Bのアクティブパラメータを持つ16BパラメータMixture of Experts (MoE)モデルであるENGINEERING Ingegneria Informatica S.p.A.のEngGPT2MoE-16B-A3B LLMの性能をベンチマークする。様々な代表的ベンチマークで性能を調査し、比較可能なサイズのオープンソースMoEと高密度モデルと比較した。 FastwebMIIA-7B、Minerva-7B、Velvet-14B、LLaMAntino-3-ANITA-8B、EngGPT2MoE-16B-A3Bといったイタリアの一般的なモデルと比較すると、ARC-Challenge、GSM8K、AIME24、AIME25、MMLU、HumanEval(HE)といった国際ベンチマークでも同様に優れている。 RULERベンチマークの最長コンテキスト設定(32k)で最高のパフォーマンスを達成する。イタリアのベンチマークデータセットであるITALICでは、このモデルはVelvet-14Bを除く他のモデルと同等かそれ以上に性能が向上している。一般的なMoEモデルと比較すると、新しいモデルはすべてのベンチマークでDeepSeek-MoE-16B-Chatよりも高い値を報告している。 HE、MMLU、AIME24、AIME25、GSM8K、32k RULERではMoonlight-16B-A3Bよりも高い値を持つが、BFCLやARCやITALICでは低い値である。最後に、ほとんどのベンチマークでは、HE、MMLU、AIME24、AIME25、GSM8K、ARC、BFCL、RULER 32kなど、GPT-OSS-20Bよりも低い値を持つ。一般的な高密度モデルと比較すると、EngGPT2MoE-16B-A3Bは、Llama-3.1-8B-Instruct、Gemma-3-12b-it、Ministral-3-8B Instruct-2512-BF16よりもAIME24とAIME25の値が高いが、ITALIC、BFCL、RULERの値は32kである。 EngGPT2MoE-16B-A3Bは,評価対象のイタリアモデルよりも高い性能を示し,特にGPT-5 nanoとQwen3-8Bのいくつかの国際モデルよりも低い性能を示した。その結果,イタリア原産の大規模言語モデルにとって,新たなモデルが一歩前進することが明らかとなった。

論文の概要: Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs

関連論文リスト