Related papers: Scalable Vision Language Model Training via High Quality Data Curation

Scalable Vision Language Model Training via High Quality Data Curation

URL: http://arxiv.org/abs/2501.05952v2
Date: Mon, 17 Feb 2025 12:04:53 GMT
Title: Scalable Vision Language Model Training via High Quality Data Curation
Authors: Hongyuan Dong, Zijian Kang, Weijie Yin, Xiao Liang, Chao Feng, Jiao Ran,
Abstract summary: We introduce an open-source vision language model (VLM) series achieving state-of-the-art (SOTA) performance in 2B and 8B parameters.<n>The following three key improvements contribute to SAILVL's leading performance.
Score: 10.121967684111445
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In this paper, we introduce SAIL-VL (ScAlable Vision Language Model TraIning via High QuaLity Data Curation), an open-source vision language model (VLM) series achieving state-of-the-art (SOTA) performance in 2B and 8B parameters. The following three key improvements contribute to SAIL-VL's leading performance: (1) Scalable high-quality visual understanding data construction: We implement a data construction pipeline to enable hundred-million-scale high-quality recaption data annotation, and the resulted dataset SAIL-Caption is validated to be of the highest data quality compared with opensource alternatives. (2) Scalable Pretraining with High-Quality Visual Understanding Data: We scale SAIL-VL's pretraining budget up to 655B tokens and show that even a 2B VLM benefits from scaled up training data sizes, exhibiting expected data size scaling laws in visual understanding and instruction following performance. (3) Scalable SFT via data quantity and complexity scaling: We curate a high-quality SFT dataset collection which outperforms opensource alternatives in data quantity scaling effectiveness. We also demonstrate that training with progressively higher-complexity data surpasses baseline one-stage training by a large margin. SAIL-VL series models achieve the highest average score in 18 widely used VLM benchmarks in our evaluation, with the 2B model takes the top position over VLMs of comparable sizes on OpenCompass 2024 (https://rank.opencompass.org.cn/leaderboard-multimodal) demonstrating robust visual comprehension abilities. SAIL-VL series models are released at HuggingFace (https://huggingface.co/BytedanceDouyinContent).

Related papers

HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models [15.877790469608662]
We introduce an LVLM-driven data refinement pipeline to enhance the quality of image-text pair data.<n>We propose a training paradigm that extends conventional contrastive learning by incorporating negative descriptions and short tags.<n>Our approach achieves state-of-the-art performance in zero-shot classification, cross-modal retrieval, and fine-grained visual understanding tasks.
arXiv Detail & Related papers (2025-07-30T07:21:36Z)
Quality-Driven Curation of Remote Sensing Vision-Language Data via Learned Scoring Models [9.238739743596236]
We propose a novel score model trained on large-scale RS visionlanguage preference data for automated quality assessment. Our empirical results demonstrate that fine-tuning CLIP or advanced VLMs with the top 30% of data ranked by our score model achieves superior interpretation accuracy.
arXiv Detail & Related papers (2025-03-02T05:44:56Z)
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling [128.24325909395188]
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0. InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet. We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems.
arXiv Detail & Related papers (2024-12-06T18:57:08Z)
Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation [57.34255010956452]
This work revisits scaling with synthetic data and focuses on developing video-LLMs from a data-centric perspective.<n>We propose a data augmentation method called Sparrow, which synthesizes video-like samples from pure text instruction data.<n>Our proposed method achieves performance comparable to or even superior to that of baselines trained with significantly more samples.
arXiv Detail & Related papers (2024-11-29T18:59:54Z)
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data [35.85909368345219]
We introduce Infinity-MM, a large-scale multimodal instruction dataset.<n>We perform unified preprocessing, resulting in a dataset with over 40 million samples that ensures diversity and accuracy.<n>We propose a synthetic instruction generation method based on a tagging system and open-source Vision-Language Models.
arXiv Detail & Related papers (2024-10-24T09:03:48Z)
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks [60.5257456681402]
We study the potential for building universal embeddings capable of handling a wide range of downstream tasks. We build a series of VLM2Vec models on SoTA VLMs like Phi-3.5-V, LLaVA-1.6 and evaluate them on MMEB's evaluation split. Our results show that VLM2Vec achieves an absolute average improvement of 10% to 20% over existing multimodal embedding models.
arXiv Detail & Related papers (2024-10-07T16:14:05Z)
NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks. We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z)
BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline [34.518474035662905]
General capabilities of Large Language Models (LLM) highly rely on extensive pretraining datasets, treated as commercial secrets by several institutions. We open-source the details of a universally applicable data processing pipeline to validate its effectiveness and potential. BaichuanSEED demonstrates consistency and predictability throughout training and achieves comparable performance on comprehensive benchmarks.
arXiv Detail & Related papers (2024-08-27T14:08:23Z)
Concept-skill Transferability-based Data Selection for Large Vision-Language Models [56.0725292404808]
We introduce COINCIDE, an effective and scalable data selection technique for training vision-language models. We cluster the training data using internal activations from a small model, which identifies concept-skill compositions needed by a target LVLM. Experiments demonstrate that COINCIDE achieves superior performance and data selection efficiency against 8 strong baselines.
arXiv Detail & Related papers (2024-06-16T16:15:20Z)
ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models [45.040292339670096]
Large vision-language models (LVLMs) have shown premise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities. This study aims to bridge the performance gap between traditional-scale LVLMs and resource-friendly lite versions by adopting high-quality training data.
arXiv Detail & Related papers (2024-02-18T19:26:49Z)
VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons. We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z)
Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD) Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning. The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z)
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions [31.4943447481144]
We study joint and language (VL) pre-training to enable cross-modality learning and benefit plentiful downstream tasks. Our model achieves new state-of-the-art results in 10 understanding tasks and 2 more novel text-to-visual generation tasks.
arXiv Detail & Related papers (2021-11-19T17:36:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.