Related papers: STEP3-VL-10B Technical Report

STEP3-VL-10B Technical Report

URL: http://arxiv.org/abs/2601.09668v2
Date: Thu, 15 Jan 2026 17:06:04 GMT
Title: STEP3-VL-10B Technical Report
Authors: Ailin Huang, Chengyuan Yao, Chunrui Han, Fanqi Wan, Hangyu Guo, Haoran Lv, Hongyu Zhou, Jia Wang, Jian Zhou, Jianjian Sun, Jingcheng Hu, Kangheng Lin, Liang Zhao, Mitt Huang, Song Yuan, Wenwen Qu, Xiangfeng Wang, Yanlin Lai, Yingxiu Zhao, Yinmin Zhang, Yukang Shi, Yuyang Chen, Zejia Weng, Ziyang Meng, Ang Li, Aobo Kong, Bo Dong, Changyi Wan, David Wang, Di Qi, Dingming Li, En Yu, Guopeng Li, Haiquan Yin, Han Zhou, Hanshan Zhang, Haolong Yan, Hebin Zhou, Hongbo Peng, Jiaran Zhang, Jiashu Lv, Jiayi Fu, Jie Cheng, Jie Zhou, Jisheng Yin, Jingjing Xie, Jingwei Wu, Jun Zhang, Junfeng Liu, Kaijun Tan, Kaiwen Yan, Liangyu Chen, Lina Chen, Mingliang Li, Qian Zhao, Quan Sun, Shaoliang Pang, Shengjie Fan, Shijie Shang, Siyuan Zhang, Tianhao You, Wei Ji, Wuxun Xie, Xiaobo Yang, Xiaojie Hou, Xiaoran Jiao, Xiaoxiao Ren, Xiangwen Kong, Xin Huang, Xin Wu, Xing Chen, Xinran Wang, Xuelin Zhang, Yana Wei, Yang Li, Yanming Xu, Yeqing Shen, Yuang Peng, Yue Peng, Yu Zhou, Yusheng Li, Yuxiang Yang, Yuyang Zhang, Zhe Xie, Zhewei Huang, Zhenyi Lu, Zhimin Fan, Zihui Cheng, Daxin Jiang, Qi Han, Xiangyu Zhang, Yibo Zhu, Zheng Ge,
Abstract summary: STEP3-VL-10B is a lightweight foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence.<n>We implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning.<n>It records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision.
Score: 115.89015065130127
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10$\times$-20$\times$ larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.

Related papers

DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing [67.77471070868852]
DeepGen 1.0 is a lightweight 5B unified model for image generation and editing.<n>It is trained on only 50M samples, surpassing the 80B HunyuanImage by 28% on WISE and the 27B Qwen-Image-Edit by 37% on UniREditBench.<n>By open-sourcing our training code, weights, and datasets, we provide an efficient, high-performance alternative to democratize unified multimodal research.
arXiv Detail & Related papers (2026-02-12T17:44:24Z)
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters [169.7981969517903]
Step 3.5 Flash bridges frontier-level agentic intelligence and computational efficiency.<n>We focus on what matters most when building agents: sharp reasoning and fast, reliable execution.
arXiv Detail & Related papers (2026-02-11T07:53:51Z)
A Pragmatic VLA Foundation Model [66.76609538850478]
We develop LingBot-VLA with around 20,000 hours of real-world data from 9 popular dual-arm robot configurations.<n>Our model achieves clear superiority over competitors, showcasing its strong performance and broad generalizability.<n>To advance the field of robot learning, we provide open access to the code, base model, and benchmark data.
arXiv Detail & Related papers (2026-01-26T17:08:04Z)
Scaling Trends for Multi-Hop Contextual Reasoning in Mid-Scale Language Models [0.0]
We present a controlled study of multi-hop contextual reasoning in large language models.<n>We show that multi-agent systems show the inverse pattern, achieving up to 80% on reasoning tasks where rule-based methods fail.
arXiv Detail & Related papers (2026-01-06T20:18:55Z)
LFM2 Technical Report [87.58431408281973]
We present LFM2, a family of Liquid Foundation Models designed for efficient on-device deployment and strong task capabilities.<n>The LFM2 family covers 350M-8.3B parameters, including dense models (350M, 700M, 1.2B, 2.6B) and a mixture-of-experts variant (8.3B total, 1.5B active)<n>We build multimodal and retrieval variants: LFM2-VL for vision-latency tasks, LFM2-Audio for speech, and LFM2-ColBERT for retrieval.
arXiv Detail & Related papers (2025-11-28T17:56:35Z)
Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data [55.65426108082807]
We build Uni-MoE-2.0- Omni from scratch through three core contributions.<n>It is capable of omnimodal understanding, as well as generating images, text, and speech.
arXiv Detail & Related papers (2025-11-16T14:10:55Z)
Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model [100.86587937568832]
Ring-1T is the first open-source, state-of-the-art thinking model with a trillion-scale parameter.<n>It features 1 trillion total parameters and activates approximately 50 billion per token.
arXiv Detail & Related papers (2025-10-21T17:46:14Z)
SAIL-VL2 Technical Report [65.45818722427506]
We introduce SAIL-VL2, an open-suite vision foundation model (LVM) for comprehensive multimodal understanding and reasoning.<n>SAIL-VL2 achieves state-of-the-art performance at the 2B and 8B parameter scales across diverse image and video benchmarks.
arXiv Detail & Related papers (2025-09-17T14:34:02Z)
STELAR-VISION: Self-Topology-Aware Efficient Learning for Aligned Reasoning in Vision [24.162895928364062]
We introduce STELAR-Vision, a training framework for topology-aware reasoning.<n>At its core is TopoAug, a synthetic data pipeline that enriches training with diverse topological structures.<n>On MATH-V and VLM-S2H, STELAR-Vision improves accuracy by 9.7% over its base model and surpasses the larger Qwen2VL-72B-Instruct by 7.3%.
arXiv Detail & Related papers (2025-08-12T07:27:50Z)
TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training [17.157552816494427]
This paper introduces TorchTitan, an open-source, PyTorch-native distributed training system.<n>It unifies state-of-the-art techniques, streamlining integration and reducing overhead.<n>We evaluate TorchTitan on the Llama 3.1 family of large language models (LLMs)
arXiv Detail & Related papers (2024-10-09T03:26:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.