Data Metabolism: An Efficient Data Design Schema For Vision Language Model
- URL: http://arxiv.org/abs/2504.12316v1
- Date: Thu, 10 Apr 2025 07:20:54 GMT
- Title: Data Metabolism: An Efficient Data Design Schema For Vision Language Model
- Authors: Jingyuan Zhang, Hongzhi Zhang, Zhou Haonan, Chenxi Sun, Xingguang ji, Jiakang Wang, Fanheng Kong, Yahui Liu, Qi Wang, Fuzheng Zhang,
- Abstract summary: Data curation plays a crucial role in training powerful Visual Language Models.<n>We show how to process existing massive datasets and build user-specific data flywheel.<n>We release a VLM, named Capybara-VL, which excels in typical multimodal tasks.
- Score: 19.12576323467089
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data curation plays a crucial role in training powerful Visual Language Models (VLMs). In this work, we introduce the concept of Data Metabolism and present our data-centric framework to build VLMs throughout the development lifecycle. Starting from a standard model architecture, we discuss and provide insights into two crucial development steps: data curation and iteration, forming a closed-loop system that continuously improves model performance. We show a detailed codebook on how to process existing massive datasets and build user-specific data flywheel. As a demonstration, we release a VLM, named Capybara-VL, which excels in typical multimodal tasks (e.g. , visual question answering, scientific reasoning, and text-rich tasks). Despite its relatively compact size, Capybara-VL surpasses several open-source models that are up to 10 times larger in size. Moreover, it achieves results that are on par with those of several leading proprietary models, demonstrating its remarkable competitiveness. These results highlight the power of our data-centric framework and the potential of training smaller and more efficient VLMs.
Related papers
- GIRAFFE: Design Choices for Extending the Context Length of Visual Language Models [20.976319536167512]
We aim to establish an effective solution that enhances long context performance of Visual Language Models.<n>We propose Giraffe, which is effectively extended to 128K lengths.<n>We will open-source the code, data, and models.
arXiv Detail & Related papers (2024-12-17T09:57:21Z) - Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling [128.24325909395188]
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0.
InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet.
We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems.
arXiv Detail & Related papers (2024-12-06T18:57:08Z) - Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data [35.85909368345219]
We introduce Infinity-MM, a large-scale multimodal instruction dataset.
We perform unified preprocessing, resulting in a dataset with over 40 million samples that ensures diversity and accuracy.
We propose a synthetic instruction generation method based on a tagging system and open-source Vision-Language Models.
arXiv Detail & Related papers (2024-10-24T09:03:48Z) - EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models [36.576853882830896]
We introduce EvolveDirector to train a text-to-image generation model comparable to advanced models using publicly available resources.
This framework interacts with advanced models through their public APIs to obtain text-image data pairs to train a base model.
We leverage pre-trained large vision-language models (VLMs) to guide the evolution of the base model.
arXiv Detail & Related papers (2024-10-09T17:52:28Z) - VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks [60.5257456681402]
We study the potential for building universal embeddings capable of handling a wide range of downstream tasks.<n>We build a series of VLM2Vec models on SoTA VLMs like Phi-3.5-V, LLaVA-1.6 and evaluate them on MMEB's evaluation split.<n>Our results show that VLM2Vec achieves an absolute average improvement of 10% to 20% over existing multimodal embedding models.
arXiv Detail & Related papers (2024-10-07T16:14:05Z) - NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development [67.55944651679864]
We present a new sandbox suite tailored for integrated data-model co-development.<n>This sandbox provides a feedback-driven experimental platform, enabling cost-effective and guided refinement of both data and models.
arXiv Detail & Related papers (2024-07-16T14:40:07Z) - Reformulating Vision-Language Foundation Models and Datasets Towards
Universal Multimodal Assistants [65.47222691674074]
Muffin framework employs pre-trained vision-language models to act as providers of visual signals.
UniMM-Chat dataset explores the complementarities of datasets to generate 1.1M high-quality and diverse multimodal instructions.
arXiv Detail & Related papers (2023-10-01T12:35:18Z) - WanJuan: A Comprehensive Multimodal Dataset for Advancing English and
Chinese Large Models [69.96148259273065]
"Wan Juan" is a large-scale multimodal dataset composed of both Chinese and English data, collected from a wide range of web sources.
It was utilized in the training of InternLM, a model that demonstrated significant advantages in multi-dimensional evaluations when compared to models of a similar scale.
arXiv Detail & Related papers (2023-08-21T14:40:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.