Fugu-MT 論文翻訳(概要): Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models

論文の概要: Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models

arxiv url: http://arxiv.org/abs/2604.07035v1
Date: Wed, 08 Apr 2026 12:50:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-09 17:30:51.534406
Title: Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models
Title（参考訳）: Gemma 4, Phi-4, Qwen3:DenseおよびMoE推論言語モデルにおける精度効率トレードオフ
Authors: Md Motaleb Hossen Manik, Ge Wang,
Abstract要約: Mixture-of-experts (MoE)言語モデルは、高密度モデルよりも優れた品質と効率のトレードオフをもたらすことがしばしば期待されている。そこで本研究では,高密度および高密度なMoE設計にまたがる7つの推論指向命令調整モデルのベンチマークを示す。
参考スコア（独自算出の注目度）: 6.396911723204044
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mixture-of-experts (MoE) language models are often expected to offer better quality-efficiency tradeoffs than dense models because only a subset of parameters is activated per token, but the practical value of that advantage depends on end-to-end behavior under realistic inference constraints. We present a controlled empirical benchmark of seven recent reasoning-oriented instruction-tuned models spanning dense and MoE designs, namely Gemma-4-E2B, Gemma-4-E4B, Gemma-4-26B-A4B, Phi-4-mini-reasoning, Phi-4-reasoning, Qwen3-8B, and Qwen3-30B-A3B, evaluated on four benchmarks -- ARC-Challenge, GSM8K, Math Level 1-3, and TruthfulQA MC1 -- under three prompting strategies: zero-shot, chain-of-thought, and few-shot chain-of-thought. The study covers 8,400 total model-dataset-prompt evaluations and records accuracy, latency, peak GPU memory usage (VRAM), and an approximate floating-point operations (FLOPs)-per-token proxy. Across the weighted multi-task summary, Gemma-4-E4B with few-shot chain-of-thought achieved the best overall result, reaching weighted accuracy 0.675 with mean VRAM 14.9 GB, while Gemma-4-26B-A4B was close in accuracy at 0.663 but substantially more memory intensive at 48.1 GB. At the task level, Gemma models dominated ARC and Math, Phi models were strongest on TruthfulQA, and GSM8K showed the largest prompt sensitivity, including a sharp drop for Phi-4-reasoning from 0.67 under chain-of-thought to 0.11 under few-shot chain-of-thought. These results show that sparse activation alone does not guarantee the best practical operating point: observed accuracy-efficiency tradeoffs depend jointly on architecture, prompting protocol, and task composition. We release a reproducible benchmark pipeline, aggregated results, and paired statistical analyses to support deployment-oriented evaluation of reasoning LLMs under real resource constraints.
Abstract（参考訳）: Mixture-of-experts (MoE)言語モデルは、トークン毎にパラメータのサブセットだけをアクティブにするため、高密度モデルよりも優れた品質効率トレードオフを提供すると予想されることが多いが、その利点の実践的価値は、現実的な推論制約の下でのエンドツーエンドの振る舞いに依存している。ゼロショット,チェーン・オブ・シント,および少数ショット・チェーン・オブ・シントという3つの戦略の下で, ARC-Challenge, GSM8K, Math Level 1-3, TruthfulQA MC1 の4つのベンチマークで評価した,近年の高密度・高密度・高密度・高密度・高密度な7つの推論指向型モデル,すなわち Gemma-4-E2B, Gemma-4-26B-A4B, Phi-4-mini-reasoning, Phi-4-reasoning, Qwen3-8B, Qwen3-30B-A3B を比較検討した。この研究は8,400のモデルデータセット・プロンプト評価と、精度、レイテンシ、ピークGPUメモリ使用量(VRAM)、および近似浮動小数点演算(FLOP)/トークンプロキシをカバーしている。重み付けされたマルチタスクの要約の中で、数発のチェーンを持つGemma-4-E4Bは、平均的なVRAM 14.9 GBの重み付き精度0.675に達し、Gemma-4-26B-A4Bは0.663の精度で、メモリは48.1 GBにほぼ集約された。タスクレベルでは、GemmaモデルはARCとMathを支配し、PhiモデルはTruthfulQAで最強であり、GSM8KはPhi-4-reasoningを0.67から0.11に急降下させた。これらの結果から,スパースアクティベーションだけでは最高の運用ポイントが保証されないことが明らかとなった。観測精度・効率トレードオフは, アーキテクチャ, プロンプトプロトコル, タスク構成に大きく依存する。実資源制約下でのLCMの展開指向評価を支援するために,再現性のあるベンチマークパイプライン,集計結果,およびペア統計解析を行った。

論文の概要: Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models

関連論文リスト