Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale
Abstract Overview
Intern-S1-Pro is presented as the first one-trillion-parameter scientific multimodal foundation model, built on a sparse Mixture-of-Experts (MoE) architecture expanded from Intern-S1 with 4× more experts. The model targets both general capabilities and deep expertise across chemistry, materials science, life sciences, and earth sciences, covering over 100 specialized tasks. Key architectural contributions include grouped routing for absolute expert load balancing under expert parallelism, a straight-through estimator for dense-gradient router optimization, a native-resolution vision encoder, Fourier position encoding, and a dedicated time-series module with adaptive subsampling. Continued pretraining uses 6T tokens, including approximately 270B tokens of scientific image-text data produced by a PDF-based caption pipeline. The paper also details system co-design between XTuner and LMDeploy to enable stable mixed-precision reinforcement learning at trillion scale, with techniques such as rollout router replay and targeted FP8 quantization of expert layers.
Novelty
The paper's primary contribution is scaling a multimodal foundation model to one trillion parameters with explicit targeting of scientific domains, combining expert expansion with grouped routing that achieves absolute load balancing across devices under expert parallelism. It also introduces a straight-through estimator for dense gradient flow to all router embeddings during sparse Top-K selection, and a large-scale PDF-based caption pipeline that produced approximately 270B tokens of scientific image-text data for cross-modal alignment.
Results
On scientific benchmarks, Intern-S1-Pro reports leading scores including 55.5 on SciReasoner (vs. 14.7 for Gemini-3-Pro), 74.8 on SmolInstruct, 72.8 on MatBench, 48.8 on Mol-Instructions, 52.5 on Biology-Instruction, and 52.8 on XLRS-Bench. On general benchmarks it remains competitive, scoring 93.1 on AIME-2025, 86.6 on MMLU-Pro, 77.4 on GAIA, 80.9 on τ²-Bench, and 93.6 on ScreenSpot V2. The dedicated time-series module substantially outperforms text-only and vision-language baselines on the reported SciTS subset, achieving an F1 of 99.5 on EAU01 and 88.3 on BIU03.
Key Points
- Intern-S1-Pro scales scientific multimodal modeling to one trillion parameters via expert expansion with grouped routing, achieving absolute load balancing under 8-way expert parallelism and enabling stable mixed-precision RL training through co-design of XTuner and LMDeploy.
- The training recipe includes 6T tokens for continued pretraining, with approximately 270B tokens of PDF-derived scientific image-text captions produced by a dedicated pipeline using layout analysis, perceptual hashing deduplication, topic-based model routing, and a text quality discriminator.
- The model reports leading results on multiple scientific benchmarks (e.g., SciReasoner 55.5, SmolInstruct 74.8, MatBench 72.8) while maintaining competitive general performance, and a case study on biological tasks demonstrates that joint training of a large generalist model can outperform a specialized model trained on the same data.
References
- arXiv: https://arxiv.org/abs/2603.25040v1
- Fugu-MT: https://fugumt.com/fugumt/paper_check/2603.25040v1
- Hugging Face Papers: https://huggingface.co/papers/2603.25040