SAIL-VL2 Technical Report
- URL: http://arxiv.org/abs/2509.14033v2
- Date: Thu, 18 Sep 2025 15:10:25 GMT
- Title: SAIL-VL2 Technical Report
- Authors: Weijie Yin, Yongjie Ye, Fangxun Shu, Yue Liao, Zijian Kang, Hongyuan Dong, Haiyang Yu, Dingkang Yang, Jiacong Wang, Han Wang, Wenzhuo Liu, Xiao Liang, Shuicheng Yan, Chao Feng,
- Abstract summary: We introduce SAIL-VL2, an open-suite vision foundation model (LVM) for comprehensive multimodal understanding and reasoning.<n>SAIL-VL2 achieves state-of-the-art performance at the 2B and 8B parameter scales across diverse image and video benchmarks.
- Score: 65.45818722427506
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce SAIL-VL2, an open-suite vision-language foundation model (LVM) for comprehensive multimodal understanding and reasoning. As the successor to SAIL-VL, SAIL-VL2 achieves state-of-the-art performance at the 2B and 8B parameter scales across diverse image and video benchmarks, demonstrating strong capabilities from fine-grained perception to complex reasoning. Its effectiveness is driven by three core innovations. First, a large-scale data curation pipeline with scoring and filtering strategies enhances both quality and distribution across captioning, OCR, QA, and video data, improving training efficiency. Second, a progressive training framework begins with a powerful pre-trained vision encoder (SAIL-ViT), advances through multimodal pre-training, and culminates in a thinking-fusion SFT-RL hybrid paradigm that systematically strengthens model capabilities. Third, architectural advances extend beyond dense LLMs to efficient sparse Mixture-of-Experts (MoE) designs. With these contributions, SAIL-VL2 demonstrates competitive performance across 106 datasets and achieves state-of-the-art results on challenging reasoning benchmarks such as MMMU and MathVista. Furthermore, on the OpenCompass leaderboard, SAIL-VL2-2B ranks first among officially released open-source models under the 4B parameter scale, while serving as an efficient and extensible foundation for the open-source multimodal community.
Related papers
- MindGPT-4ov: An Enhanced MLLM via a Multi-Stage Post-Training Paradigm [25.7631608456086]
MindGPT-4ov is a general post-training paradigm spanning data production, model training, and efficient deployment.<n>It achieves state-of-the-art performance across multiple benchmarks at low cost.<n>MindGPT-4ov also demonstrates superior user experience in vertical domain tasks.
arXiv Detail & Related papers (2025-12-02T16:04:11Z) - Kwai Keye-VL Technical Report [80.53170317017147]
We introduce textbfKwai Keye-VL, a multimodal foundation model for short-video understanding.<n>The development of Keye-VL rests on two core pillars: a massive, high-quality dataset with a strong emphasis on video, and an innovative training recipe.<n>To validate our approach, we conduct extensive evaluations, showing that Keye-VL achieves state-of-the-art results on public video benchmarks and remains highly competitive on general image-based tasks.
arXiv Detail & Related papers (2025-07-02T17:57:28Z) - MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z) - InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models [139.19991097260115]
We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm.<n>In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs.<n>In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.
arXiv Detail & Related papers (2025-04-14T17:59:25Z) - Scalable Vision Language Model Training via High Quality Data Curation [10.121967684111445]
We introduce an open-source vision language model (VLM) series achieving state-of-the-art (SOTA) performance in 2B and 8B parameters.<n>The following three key improvements contribute to SAIL-VL's leading performance.
arXiv Detail & Related papers (2025-01-10T13:27:04Z) - Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling [191.7830199016589]
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0.<n>InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet.<n>We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems.
arXiv Detail & Related papers (2024-12-06T18:57:08Z) - NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.