VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search
- URL: http://arxiv.org/abs/2503.10582v2
- Date: Sat, 15 Mar 2025 01:09:17 GMT
- Title: VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search
- Authors: Yiming Jia, Jiachen Li, Xiang Yue, Bo Li, Ping Nie, Kai Zou, Wenhu Chen,
- Abstract summary: Vision-Language Models have made significant progress on many perception-focused tasks.<n>However, their progress on reasoning-focused tasks remains limited due to the lack of high-quality and diverse training data.<n>We propose VisualWebInstruct to create a diverse and high-quality dataset spanning multiple disciplines.
- Score: 42.560419395815146
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language Models have made significant progress on many perception-focused tasks. However, their progress on reasoning-focused tasks remains limited due to the lack of high-quality and diverse training data. In this work, we aim to address the scarcity of reasoning-focused multimodal datasets. We propose VisualWebInstruct, a novel approach that leverages search engines to create a diverse and high-quality dataset spanning multiple disciplines, including mathematics, physics, finance, and chemistry, etc. Starting with a meticulously selected set of 30,000 seed images, we employ Google Image Search to identify websites containing similar images. We collect and process HTML data from over 700K unique URLs. Through a pipeline of content extraction, filtering, and synthesis, we construct a dataset of approximately 900K question-answer (QA) pairs, with 40% consisting of visual QA pairs and the remaining comprising text-based QA pairs. Models fine-tuned on VisualWebInstruct demonstrate significant performance improvements: (1) fine-tuning on Llava-OV results in 10-20 absolute points improvement across benchmarks, and (2) fine-tuning from MAmmoTH-VL yields a 5 absolute points gain across benchmarks. Our best model, MAmmoTH-VL2, achieves state-of-the-art performance within the 10B parameter class on MMMU-Pro (40.7), MathVerse (42.6), and DynaMath (55.7). These results highlight the effectiveness of our dataset in enhancing the reasoning capabilities of vision-language models for complex multimodal tasks.
Related papers
- ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models [103.25208095165486]
Existing practices rely on powerful but costly large language models (LLMs) or multimodal language models (MLMs) to produce instruction data.<n>We present a programmatic approach that employs scene graphs as symbolic representations of images and human-written programs to systematically synthesize vision-centric instruction data.<n>Our approach ensures the interpretability and controllability of the data generation process and scales efficiently while maintaining factual accuracy.
arXiv Detail & Related papers (2024-12-09T21:44:02Z) - InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning [58.7966588457529]
InfiMM-WebMath-40B is a high-quality dataset of interleaved image-text documents.
It comprises 24 million web pages, 85 million associated image URLs, and 40 billion text tokens, all meticulously extracted and filtered from CommonCrawl.
Our evaluations on text-only benchmarks show that, despite utilizing only 40 billion tokens, our dataset significantly enhances the performance of our 1.3B model.
Our models set a new state-of-the-art among open-source models on multi-modal math benchmarks such as MathVerse and We-Math.
arXiv Detail & Related papers (2024-09-19T08:41:21Z) - DLM-VMTL:A Double Layer Mapper for heterogeneous data video Multi-task prompt learning [2.4121373594852846]
Multi-Task Learning makes the visual task acquire the rich shareable knowledge from other tasks while joint training.
Heterogenous data video multi-task prompt learning (VMTL) method is proposed to address above problem.
Double-Layers Mapper(DLM) is proposed to extract the shareable knowledge into visual promptS and align it with representation of primary task.
arXiv Detail & Related papers (2024-08-29T01:25:36Z) - VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - Less is More: High-value Data Selection for Visual Instruction Tuning [127.38740043393527]
We propose a high-value data selection approach TIVE, to eliminate redundancy within the visual instruction data and reduce the training cost.
Our approach using only about 15% data can achieve comparable average performance to the full-data fine-tuned model across eight benchmarks.
arXiv Detail & Related papers (2024-03-14T16:47:25Z) - Pushing Boundaries: Exploring Zero Shot Object Classification with Large
Multimodal Models [0.09264362806173355]
Large Language and Vision Assistant models (LLVAs) engage users in rich conversational experiences intertwined with image-based queries.
This paper takes a unique perspective on LMMs, exploring their efficacy in performing image classification tasks using tailored prompts.
Our study includes a benchmarking analysis across four diverse datasets: MNIST, Cats Vs. Dogs, Hymnoptera (Ants Vs. Bees), and an unconventional dataset comprising Pox Vs. Non-Pox skin images.
arXiv Detail & Related papers (2023-12-30T03:19:54Z) - ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model
for Visual Question Answering in Vietnamese [1.6340299456362617]
We introduce the ViCLEVR dataset, a pioneering collection for evaluating various visual reasoning capabilities in Vietnamese.
We conduct a comprehensive analysis of contemporary visual reasoning systems, offering valuable insights into their strengths and limitations.
We present PhoVIT, a comprehensive multimodal fusion that identifies objects in images based on questions.
arXiv Detail & Related papers (2023-10-27T10:44:50Z) - Pink: Unveiling the Power of Referential Comprehension for Multi-modal
LLMs [49.88461345825586]
This paper proposes a new framework to enhance the fine-grained image understanding abilities of MLLMs.
We present a new method for constructing the instruction tuning dataset at a low cost by leveraging annotations in existing datasets.
We show that our model exhibits a 5.2% accuracy improvement over Qwen-VL and surpasses the accuracy of Kosmos-2 by 24.7%.
arXiv Detail & Related papers (2023-10-01T05:53:15Z) - SVIT: Scaling up Visual Instruction Tuning [26.794950789335402]
We build a dataset of 4.2 million visual instruction tuning data including 1.6M conversation question-answer (QA) pairs, 1.6M complex reasoning QA pairs, 1.0M referring QA pairs and 106K detailed image descriptions.
Experiments verify that SVIT-v1.5, trained on the proposed dataset, outperforms state-of-the-art Multimodal Large Language Models on popular benchmarks.
arXiv Detail & Related papers (2023-07-09T03:25:14Z) - Multimodal Web Navigation with Instruction-Finetuned Foundation Models [99.14209521903854]
We study data-driven offline training for web agents with vision-language foundation models.
We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages.
We empirically demonstrate this recipe improves the agent's ability of grounded multimodal perception, HTML comprehension, and multi-step reasoning.
arXiv Detail & Related papers (2023-05-19T17:44:34Z) - WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models [2.603259641572195]
We introduce a large-scale multi-modal corpora named WuDaoMM, totally containing more than 650M image-text pairs.
About 600 million pairs of data are collected from multiple webpages in which image and caption present weak correlation.
We also release a base version of WuDaoMM with 5 million strong-correlated image-text pairs, which is sufficient to support the common cross-modal model pre-training.
arXiv Detail & Related papers (2022-03-22T06:12:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.