SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for
Multi-modal Large Language Models
- URL: http://arxiv.org/abs/2311.07575v1
- Date: Mon, 13 Nov 2023 18:59:47 GMT
- Title: SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for
Multi-modal Large Language Models
- Authors: Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao,
Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, Jiaming Han, Siyuan Huang, Yichi
Zhang, Xuming He, Hongsheng Li, Yu Qiao
- Abstract summary: We present a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings.
Based on our proposed joint mixing, we propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images.
We hope our work may cast a light on the exploration of joint mixing in future MLLM research.
- Score: 86.478087039015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present SPHINX, a versatile multi-modal large language model (MLLM) with a
joint mixing of model weights, tuning tasks, and visual embeddings. First, for
stronger vision-language alignment, we unfreeze the large language model (LLM)
during pre-training, and introduce a weight mix strategy between LLMs trained
by real-world and synthetic data. By directly integrating the weights from two
domains, the mixed LLM can efficiently incorporate diverse semantics with
favorable robustness. Then, to enable multi-purpose capabilities, we mix a
variety of tasks for joint visual instruction tuning, and design task-specific
instructions to avoid inter-task conflict. In addition to the basic visual
question answering, we include more challenging tasks such as region-level
understanding, caption grounding, document layout detection, and human pose
estimation, contributing to mutual enhancement over different scenarios.
Additionally, we propose to extract comprehensive visual embeddings from
various network architectures, pre-training paradigms, and information
granularity, providing language models with more robust image representations.
Based on our proposed joint mixing, SPHINX exhibits superior multi-modal
understanding capabilities on a wide range of applications. On top of this, we
further propose an efficient strategy aiming to better capture fine-grained
appearances of high-resolution images. With a mixing of different scales and
high-resolution sub-images, SPHINX attains exceptional visual parsing and
reasoning performance on existing evaluation benchmarks. We hope our work may
cast a light on the exploration of joint mixing in future MLLM research. Code
is released at https://github.com/Alpha-VLLM/LLaMA2-Accessory.
Related papers
- PUMA: Empowering Unified MLLM with Multi-granular Visual Generation [62.747751204215916]
We propose PUMA, emPowering Unified MLLM with Multi-grAnular visual generation.
PUMA unifies multi-granular visual features as both inputs and outputs of MLLMs.
This work represents a significant step towards a truly unified MLLM capable of adapting to the granularity demands of various visual tasks.
arXiv Detail & Related papers (2024-10-17T17:59:57Z) - SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation [61.392147185793476]
We present a unified and versatile foundation model, namely, SEED-X.
SEED-X is able to model multi-granularity visual semantics for comprehension and generation tasks.
We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications.
arXiv Detail & Related papers (2024-04-22T17:56:09Z) - Multi-modal Semantic Understanding with Contrastive Cross-modal Feature
Alignment [11.897888221717245]
This paper proposes a novel CLIP-guided contrastive-learning-based architecture to perform multi-modal feature alignment.
Our model is simple to implement without using task-specific external knowledge, and thus can easily migrate to other multi-modal tasks.
arXiv Detail & Related papers (2024-03-11T01:07:36Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models [97.40590590880144]
We develop an extensive Multimodality Large Language Model (MLLM) series.
We assemble a comprehensive dataset covering publicly available resources in language, vision, and vision-language tasks.
We obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities.
arXiv Detail & Related papers (2024-02-08T18:59:48Z) - u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model [17.3535277338312]
u-LLaVA is an innovative unifying multi-task framework that integrates pixel, regional, and global features to refine the perceptual faculties of MLLMs.
This work contributes a novel mask-based multi-task dataset comprising 277K samples, crafted to challenge and assess the fine-grained perception capabilities of MLLMs.
arXiv Detail & Related papers (2023-11-09T13:18:27Z) - Position-Enhanced Visual Instruction Tuning for Multimodal Large
Language Models [50.07056960586183]
We propose Position-enhanced Visual Instruction Tuning (PVIT) to extend the functionality of Multimodal Large Language Models (MLLMs)
This integration promotes a more detailed comprehension of images for the MLLM.
We present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model.
arXiv Detail & Related papers (2023-08-25T15:33:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.