Related papers: An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

URL: http://arxiv.org/abs/2309.09958v1
Date: Mon, 18 Sep 2023 17:30:46 GMT
Title: An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models
Authors: Yadong Lu, Chunyuan Li, Haotian Liu, Jianwei Yang, Jianfeng Gao, Yelong Shen
Abstract summary: We present an empirical study of scaling LLaVA up to 33B and 65B/70B. We find that scaling LMM consistently enhances model performance and improves language capabilities. We hope that this study makes state-of-the-art LMM research at a larger scale more accessible.
Score: 116.50367506746713
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual instruction tuning has recently shown encouraging progress with open-source large multimodal models (LMM) such as LLaVA and MiniGPT-4. However, most existing studies of open-source LMM are performed using models with 13B parameters or smaller. In this paper we present an empirical study of scaling LLaVA up to 33B and 65B/70B, and share our findings from our explorations in image resolution, data mixing and parameter-efficient training methods such as LoRA/QLoRA. These are evaluated by their impact on the multi-modal and language capabilities when completing real-world tasks in the wild. We find that scaling LMM consistently enhances model performance and improves language capabilities, and performance of LoRA/QLoRA tuning of LMM are comparable to the performance of full-model fine-tuning. Additionally, the study highlights the importance of higher image resolutions and mixing multimodal-language data to improve LMM performance, and visual instruction tuning can sometimes improve LMM's pure language capability. We hope that this study makes state-of-the-art LMM research at a larger scale more accessible, thus helping establish stronger baselines for future research. Code and checkpoints will be made public.

Related papers

Just Noticeable Difference for Large Multimodal Models [70.41467229325345]
Just noticeable difference (JND) is the minimum change that the human visual system (HVS) can perceive.<n>We take an initial attempt and demonstrate that there exist significant visual blind spots in current LMMs.<n>Our research underscores the significance of LMM-JND as a unique perspective for studying LMMs.
arXiv Detail & Related papers (2025-07-01T07:06:32Z)
Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models [19.361686225381447]
Vision-language models (VLMs) are widely assumed to exhibit in-context learning (ICL)<n>We propose a new MM-ICL with Reasoning pipeline that augments each demonstration with a generated rationale alongside the answer.
arXiv Detail & Related papers (2025-06-09T16:55:32Z)
LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning [39.54891426369773]
Trade-offs between model size, architecture, and performance remain underexplored. In this paper, we introduce LLaVA-MORE, a new family of MLLMs that integrates recent language models with diverse visual backbones. To ensure fair comparisons, we employ a unified training protocol applied consistently across all architectures.
arXiv Detail & Related papers (2025-03-19T18:10:12Z)
Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs [22.177654792824896]
We focus on small-sized language models (3B to 7B parameters) for their cost-efficiency and accessibility. We explore various training configurations and strategies across four open-source pre-trained models. Key insights from our work include: (i) larger batch sizes paired with lower learning rates lead to improved model performance on benchmarks such as MMLU, MTBench, and Open LLM Leaderboard; (ii) early-stage training dynamics, such as lower gradient norms and higher loss values, are strong indicators of better final model performance; (iv) we observed no significant difference in performance between phased and stacked training strategies, but
arXiv Detail & Related papers (2024-12-17T21:16:59Z)
LLaVA-KD: A Framework of Distilling Multimodal Large Language Models [70.19607283302712]
We propose a novel framework to transfer knowledge from l-MLLM to s-MLLM. Specifically, we introduce Multimodal Distillation (MDist) to minimize the divergence between the visual-textual output distributions of l-MLLM and s-MLLM. We also propose a three-stage training scheme to fully exploit the potential of s-MLLM.
arXiv Detail & Related papers (2024-10-21T17:41:28Z)
CoMMIT: Coordinated Instruction Tuning for Multimodal Large Language Models [68.64605538559312]
In this paper, we analyze the MLLM instruction tuning from both theoretical and empirical perspectives. Inspired by our findings, we propose a measurement to quantitatively evaluate the learning balance. In addition, we introduce an auxiliary loss regularization method to promote updating of the generation distribution of MLLMs.
arXiv Detail & Related papers (2024-07-29T23:18:55Z)
LLAVADI: What Matters For Multimodal Large Language Models Distillation [77.73964744238519]
In this work, we do not propose a new efficient model structure or train small-scale MLLMs from scratch. Our studies involve training strategies, model choices, and distillation algorithms in the knowledge distillation process. By evaluating different benchmarks and proper strategy, even a 2.7B small-scale model can perform on par with larger models with 7B or 13B parameters.
arXiv Detail & Related papers (2024-07-28T06:10:47Z)
Exploring the Capabilities of Large Multimodal Models on Dense Text [58.82262549456294]
We propose the DT-VQA dataset, with 170k question-answer pairs. In this paper, we conduct a comprehensive evaluation of GPT4V, Gemini, and various open-source LMMs. We find that even with automatically labeled training datasets, significant improvements in model performance can be achieved.
arXiv Detail & Related papers (2024-05-09T07:47:25Z)
Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models [87.47400128150032]
We propose a novel LMM architecture named Lumen, a Large multimodal model with versatile vision-centric capability enhancement. Lumen first promotes fine-grained vision-language concept alignment. Then the task-specific decoding is carried out by flexibly routing the shared representation to lightweight task decoders.
arXiv Detail & Related papers (2024-03-12T04:13:45Z)
TinyLLaVA: A Framework of Small-scale Large Multimodal Models [11.686023770810937]
We study the effects of different vision encoders, connection modules, language models, training data and training recipes. Under our framework, we train a family of small-scale LMMs. Our best model, TinyLLaVA-3.1B, achieves better overall performance against existing 7B models such as LLaVA-1.5 and Qwen-VL.
arXiv Detail & Related papers (2024-02-22T05:05:30Z)
CaMML: Context-Aware Multimodal Learner for Large Models [16.30752006781618]
We introduce Context-Aware MultiModal Learner (CaMML) for tuning large multimodal models (LMMs) CaMML is crafted to seamlessly integrate multimodal contextual samples into large models, empowering the model to derive knowledge from analogous, domain-specific, up-to-date information. Based on CaMML, we have developed two multimodal models, CaMML-7B and CaMML-13B, that have shown exceptional performance across an array of benchmark datasets for multimodal tasks.
arXiv Detail & Related papers (2024-01-06T07:54:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.