StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data
- URL: http://arxiv.org/abs/2308.10253v2
- Date: Thu, 28 Dec 2023 03:44:28 GMT
- Title: StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data
- Authors: Yanda Li, Chi Zhang, Gang Yu, Zhibin Wang, Bin Fu, Guosheng Lin,
Chunhua Shen, Ling Chen, Yunchao Wei
- Abstract summary: We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
- Score: 129.92449761766025
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The remarkable multimodal capabilities demonstrated by OpenAI's GPT-4 have
sparked significant interest in the development of multimodal Large Language
Models (LLMs). A primary research objective of such models is to align visual
and textual modalities effectively while comprehending human instructions.
Current methodologies often rely on annotations derived from benchmark datasets
to construct image-dialogue datasets for training purposes, akin to instruction
tuning in LLMs. However, these datasets often exhibit domain bias, potentially
constraining the generative capabilities of the models. In an effort to
mitigate these limitations, we propose a novel data collection methodology that
synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities
of ChatGPT and text-to-image generative models to yield a diverse and
controllable dataset with varied image content. Additionally, datasets can be
arbitrarily scaled. This not only provides greater flexibility compared to
existing methodologies but also significantly enhances several model
capabilities. Our research includes comprehensive experiments conducted on
various datasets. The results emphasize substantial enhancements in more than
ten commonly assessed capabilities. Additionally, our model achieves
state-of-the-art results across multiple widely recognized multimodal
benchmarks.
Related papers
- MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
This dataset includes figures such as schematic diagrams, simulated images, macroscopic/microscopic photos, and experimental visualizations.
We developed benchmarks for scientific figure captioning and multiple-choice questions, evaluating six proprietary and over ten open-source models.
The dataset and benchmarks will be released to support further research.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - Reformulating Vision-Language Foundation Models and Datasets Towards
Universal Multimodal Assistants [65.47222691674074]
Muffin framework employs pre-trained vision-language models to act as providers of visual signals.
UniMM-Chat dataset explores the complementarities of datasets to generate 1.1M high-quality and diverse multimodal instructions.
arXiv Detail & Related papers (2023-10-01T12:35:18Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - Scaling Vision-Language Models with Sparse Mixture of Experts [128.0882767889029]
We show that mixture-of-experts (MoE) techniques can achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost.
Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling vision-language models.
arXiv Detail & Related papers (2023-03-13T16:00:31Z) - Learning Sequential Latent Variable Models from Multimodal Time Series
Data [6.107812768939553]
We present a self-supervised generative modelling framework to jointly learn a probabilistic latent state representation of multimodal data.
We demonstrate that our approach leads to significant improvements in prediction and representation quality.
arXiv Detail & Related papers (2022-04-21T21:59:24Z) - Genetic Programming for Evolving a Front of Interpretable Models for
Data Visualisation [4.4181317696554325]
We propose a genetic programming approach named GPtSNE for evolving interpretable mappings from a dataset to high-quality visualisations.
A multi-objective approach is designed that produces a variety of visualisations in a single run which give different trade-offs between visual quality and model complexity.
arXiv Detail & Related papers (2020-01-27T04:03:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.