Fugu-MT 論文翻訳(概要): Cross-Platform Evaluation of Reasoning Capabilities in Foundation Models

論文の概要: Cross-Platform Evaluation of Reasoning Capabilities in Foundation Models

arxiv url: http://arxiv.org/abs/2510.26732v1
Date: Thu, 30 Oct 2025 17:31:03 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-31 16:05:09.942019
Title: Cross-Platform Evaluation of Reasoning Capabilities in Foundation Models
Title（参考訳）: 基礎モデルにおける推論能力のクロスプラットフォーム評価
Authors: J. de Curtò, I. de Zarzà, Pablo García, Jordi Cabot,
Abstract要約: 8つの学術領域にまたがる79の課題にまたがる15の基礎モデルを評価する。我々は,HPCスーパーコンピューティング,クラウドプラットフォーム,大学クラスタという,3つの計算パラダイムにまたがるインフラストラクチャに依存しないベンチマークを確立する。この結果は、従来のスケーリング仮定に挑戦し、トレーニングデータ品質をモデルサイズよりも重要なものにし、教育、生産、研究のコンテキストをまたいだモデル選択のための実行可能なガイドラインを提供する。
参考スコア（独自算出の注目度）: 1.2045707771719028
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents a comprehensive cross-platform evaluation of reasoning capabilities in contemporary foundation models, establishing an infrastructure-agnostic benchmark across three computational paradigms: HPC supercomputing (MareNostrum 5), cloud platforms (Nebius AI Studio), and university clusters (a node with eight H200 GPUs). We evaluate 15 foundation models across 79 problems spanning eight academic domains (Physics, Mathematics, Chemistry, Economics, Biology, Statistics, Calculus, and Optimization) through three experimental phases: (1) Baseline establishment: Six models (Mixtral-8x7B, Phi-3, LLaMA 3.1-8B, Gemma-2-9b, Mistral-7B, OLMo-7B) evaluated on 19 problems using MareNostrum 5, establishing methodology and reference performance; (2) Infrastructure validation: The 19-problem benchmark repeated on university cluster (seven models including Falcon-Mamba state-space architecture) and Nebius AI Studio (nine state-of-the-art models: Hermes-4 70B/405B, LLaMA 3.1-405B/3.3-70B, Qwen3 30B/235B, DeepSeek-R1, GPT-OSS 20B/120B) to confirm infrastructure-agnostic reproducibility; (3) Extended evaluation: Full 79-problem assessment on both university cluster and Nebius platforms, probing generalization at scale across architectural diversity. The findings challenge conventional scaling assumptions, establish training data quality as more critical than model size, and provide actionable guidelines for model selection across educational, production, and research contexts. The tri-infrastructure methodology and 79-problem benchmark enable longitudinal tracking of reasoning capabilities as foundation models evolve.
Abstract（参考訳）: 本稿では,HPCスーパーコンピューティング(MareNostrum 5),クラウドプラットフォーム(Nebius AI Studio),大学クラスタ(8つのH200 GPUを持つノード)という3つの計算パラダイムにまたがるインフラストラクチャ非依存のベンチマークを確立する。 1)ベースラインの確立:6つのモデル(Mixtral-8x7B, Phi-3, LLaMA 3.1-8B, Gemma-2-9b, Mistral-7B, OLMo-7B)を、MareNostrum 5を用いて評価し、方法論と基準性能を確立した。この結果は、従来のスケーリング仮定に挑戦し、トレーニングデータ品質をモデルサイズよりも重要なものにし、教育、生産、研究のコンテキストをまたいだモデル選択のための実行可能なガイドラインを提供する。三層構造法と79プロブレムのベンチマークは、基礎モデルの発展に伴って推論能力の経時的追跡を可能にする。

論文の概要: Cross-Platform Evaluation of Reasoning Capabilities in Foundation Models

関連論文リスト