ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
- URL: http://arxiv.org/abs/2311.12793v2
- Date: Tue, 28 Nov 2023 08:52:50 GMT
- Title: ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
- Authors: Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang,
Feng Zhao, Dahua Lin
- Abstract summary: We introduce ShareGPT4V, a dataset featuring 1.2 million descriptive captions.
This dataset surpasses existing datasets in diversity and information content, covering world knowledge, object properties, spatial relationships, and aesthetic evaluations.
We further incorporate ShareGPT4V data into both the pre-training and SFT phases, obtaining ShareGPT4V-7B, a superior LMM.
- Score: 81.95879920888716
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the realm of large multi-modal models (LMMs), efficient modality alignment
is crucial yet often constrained by the scarcity of high-quality image-text
data. To address this bottleneck, we introduce the ShareGPT4V dataset, a
pioneering large-scale resource featuring 1.2 million highly descriptive
captions, which surpasses existing datasets in diversity and information
content, covering world knowledge, object properties, spatial relationships,
and aesthetic evaluations. Specifically, ShareGPT4V originates from a curated
100K high-quality captions collected from advanced GPT4-Vision and has been
expanded to 1.2M with a superb caption model trained on this subset. ShareGPT4V
first demonstrates its effectiveness for the Supervised Fine-Tuning (SFT)
phase, by substituting an equivalent quantity of detailed captions in existing
SFT datasets with a subset of our high-quality captions, significantly
enhancing the LMMs like LLaVA-7B, LLaVA-1.5-13B, and Qwen-VL-Chat-7B on the MME
and MMBench benchmarks, with respective gains of 222.8/22.0/22.3 and
2.7/1.3/1.5. We further incorporate ShareGPT4V data into both the pre-training
and SFT phases, obtaining ShareGPT4V-7B, a superior LMM based on a simple
architecture that has remarkable performance across a majority of the
multi-modal benchmarks. This project is available at
https://ShareGPT4V.github.io to serve as a pivotal resource for advancing the
LMMs community.
Related papers
- LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs [52.79503055897109]
We present EvalMi-50K, a comprehensive dataset and benchmark for evaluating large-multimodal image generation.
We propose LMM4LMM, an LMM-based metric for evaluating large multimodal T2I generation from multiple dimensions.
arXiv Detail & Related papers (2025-04-11T08:46:49Z) - VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models [63.27511432647797]
We propose VLsI: Verbalized Layers-to-Interactions, a new VLM family in 2B and 7B model sizes.
We validate VLsI across ten challenging vision-language benchmarks, achieving notable performance gains (11.0% for 2B and 17.4% for 7B) over GPT-4V.
arXiv Detail & Related papers (2024-12-02T18:58:25Z) - Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models [111.97026994761254]
Mixture-of-Transformers (MoT) is a sparse multi-modal transformer architecture.
MoT decouples non-embedding parameters of the model by modality.
We evaluate MoT across multiple settings and model scales.
arXiv Detail & Related papers (2024-11-07T18:59:06Z) - Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data [21.905041803331113]
Vision-Language Models (VLMs) have recently made significant progress, but the limited scale and quality of open-source instruction data hinder their performance.
We introduce Infinity-MM, a large-scale multimodal instruction dataset with 40 million samples, enhanced through rigorous quality filtering and deduplication.
We also propose a synthetic instruction generation method based on open-source VLMs, using detailed image annotations and diverse question generation.
arXiv Detail & Related papers (2024-10-24T09:03:48Z) - FullAnno: A Data Engine for Enhancing Image Comprehension of MLLMs [58.95386070800286]
FullAnno is a data engine that generates large-scale, high-quality, and fine-grained image annotations.
We re-annotated the COCO and Visual Genome datasets using our FullAnno system.
Experiments show that the regenerated annotation can significantly enhance the capabilities of LLaVA-v1.5 on several benchmarks.
arXiv Detail & Related papers (2024-09-20T14:33:17Z) - NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites [114.22835695929682]
InternVL 1.5 is an open-source multimodal large language model (MLLM)
It bridges the capability gap between open-source and proprietary commercial models in multimodal understanding.
arXiv Detail & Related papers (2024-04-25T17:59:19Z) - ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models [45.040292339670096]
Large vision-language models (LVLMs) have shown premise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities.
This study aims to bridge the performance gap between traditional-scale LVLMs and resource-friendly lite versions by adopting high-quality training data.
arXiv Detail & Related papers (2024-02-18T19:26:49Z) - COCO is "ALL'' You Need for Visual Instruction Fine-tuning [39.438410070172125]
Visual instruction fine-tuning (IFT) is a vital process for aligning MLLMs' output with user's intentions.
Recent studies propose to construct visual IFT datasets through a multifaceted approach.
We establish a new IFT dataset, with images sourced from the COCO dataset along with more diverse instructions.
arXiv Detail & Related papers (2024-01-17T04:43:45Z) - To See is to Believe: Prompting GPT-4V for Better Visual Instruction
Tuning [82.34463739289892]
We introduce a fine-grained visual instruction dataset, LVIS-Instruct4V, which contains 220K visually aligned and context-aware instructions.
By simply replacing the LLaVA-Instruct with our LVIS-Instruct4V, we achieve better results than LLaVA on most challenging LMM benchmarks.
arXiv Detail & Related papers (2023-11-13T18:59:31Z) - SVIT: Scaling up Visual Instruction Tuning [26.794950789335402]
We build a dataset of 4.2 million visual instruction tuning data including 1.6M conversation question-answer (QA) pairs, 1.6M complex reasoning QA pairs, 1.0M referring QA pairs and 106K detailed image descriptions.
Experiments verify that SVIT-v1.5, trained on the proposed dataset, outperforms state-of-the-art Multimodal Large Language Models on popular benchmarks.
arXiv Detail & Related papers (2023-07-09T03:25:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.