What Matters in Training a GPT4-Style Language Model with Multimodal
Inputs?
- URL: http://arxiv.org/abs/2307.02469v2
- Date: Sun, 30 Jul 2023 13:20:39 GMT
- Title: What Matters in Training a GPT4-Style Language Model with Multimodal
Inputs?
- Authors: Yan Zeng, Hanbo Zhang, Jiani Zheng, Jiangnan Xia, Guoqiang Wei, Yang
Wei, Yuchen Zhang, Tao Kong
- Abstract summary: Large Language Models (LLMs) have displayed exceptional multi-modal capabilities in following open-ended instructions given images.
These models rely on design choices such as network structures, training data, and training strategies.
This paper presents a systematic and comprehensive study, quantitatively and qualitatively, on training such models.
- Score: 24.676820488258336
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in Large Language Models (LLMs) such as GPT4 have
displayed exceptional multi-modal capabilities in following open-ended
instructions given images. However, the performance of these models heavily
relies on design choices such as network structures, training data, and
training strategies, and these choices have not been extensively discussed in
the literature, making it difficult to quantify progress in this field. To
address this issue, this paper presents a systematic and comprehensive study,
quantitatively and qualitatively, on training such models. We implement over 20
variants with controlled settings. Concretely, for network structures, we
compare different LLM backbones and model designs. For training data, we
investigate the impact of data and sampling strategies. For instructions, we
explore the influence of diversified prompts on the instruction-following
ability of the trained models. For benchmarks, we contribute the first, to our
best knowledge, comprehensive evaluation set including both image and video
tasks through crowd-sourcing. Based on our findings, we present Lynx, which
performs the most accurate multi-modal understanding while keeping the best
multi-modal generation ability compared to existing open-sourced GPT4-style
models.
Related papers
- MIO: A Foundation Model on Multimodal Tokens [74.85153216521945]
We introduce MIO, a novel foundation model built on multimodal tokens.
MIO is capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner.
arXiv Detail & Related papers (2024-09-26T09:57:16Z) - NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
This dataset includes figures such as schematic diagrams, simulated images, macroscopic/microscopic photos, and experimental visualizations.
We developed benchmarks for scientific figure captioning and multiple-choice questions, evaluating six proprietary and over ten open-source models.
The dataset and benchmarks will be released to support further research.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities [17.374241865041856]
We show the possibility of training one model to solve at least 3x more tasks/modalities than existing ones and doing so without a loss in performance.
We successfully scale the training to a three billion parameter model using tens of modalities and different datasets.
The resulting models and training code are open sourced at 4m.epfl.ch.
arXiv Detail & Related papers (2024-06-13T17:59:42Z) - What matters when building vision-language models? [52.8539131958858]
We develop Idefics2, an efficient foundational vision-language model with 8 billion parameters.
Idefics2 achieves state-of-the-art performance within its size category across various multimodal benchmarks.
We release the model (base, instructed, and chat) along with the datasets created for its training.
arXiv Detail & Related papers (2024-05-03T17:00:00Z) - The Revolution of Multimodal Large Language Models: A Survey [46.84953515670248]
Multimodal Large Language Models (MLLMs) can seamlessly integrate visual and textual modalities.
This paper provides a review of recent visual-based MLLMs, analyzing their architectural choices, multimodal alignment strategies, and training techniques.
arXiv Detail & Related papers (2024-02-19T19:01:01Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - InstructBLIP: Towards General-purpose Vision-Language Models with
Instruction Tuning [43.54069813039309]
We study vision-language instruction tuning based on the pretrained BLIP-2 models.
InstructBLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets.
Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks.
arXiv Detail & Related papers (2023-05-11T00:38:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.