Related papers: SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

URL: http://arxiv.org/abs/2402.05935v2
Date: Wed, 26 Jun 2024 07:59:03 GMT
Title: SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
Authors: Dongyang Liu, Renrui Zhang, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng Jin, Kaipeng Zhang, Wenqi Shao, Chao Xu, Conghui He, Junjun He, Hao Shao, Pan Lu, Hongsheng Li, Yu Qiao, Peng Gao,
Abstract summary: We develop an extensive Multimodality Large Language Model (MLLM) series. We assemble a comprehensive dataset covering publicly available resources in language, vision, and vision-language tasks. We obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities.
Score: 97.40590590880144
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX. To improve the architecture and training efficiency, we modify the SPHINX framework by removing redundant visual encoders, bypassing fully-padded sub-images with skip tokens, and simplifying multi-stage training into a one-stage all-in-one paradigm. To fully unleash the potential of MLLMs, we assemble a comprehensive multi-domain and multimodal dataset covering publicly available resources in language, vision, and vision-language tasks. We further enrich this collection with our curated OCR intensive and Set-of-Mark datasets, extending the diversity and generality. By training over different base LLMs including TinyLlama1.1B, InternLM2-7B, LLaMA2-13B, and Mixtral8x7B, we obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities. Comprehensive benchmarking reveals a strong correlation between the multi-modal performance with the data and parameter scales. Code and models are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory

Related papers

TAMP: Token-Adaptive Layerwise Pruning in Multimodal Large Language Models [23.916205754112774]
Multimodal Large Language Models (MLLMs) have shown remarkable versatility in understanding diverse multimodal data and tasks. We propose TAMP, a simple yet effective pruning framework tailored for MLLMs. We validate our method on two state-of-the-art MLLMs: LLaVA-NeXT, designed for vision-language tasks, and VideoLLaMA2, capable of processing audio, visual, and language modalities.
arXiv Detail & Related papers (2025-04-14T05:44:38Z)
LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models [89.13128402847943]
We present LUSIFER, a novel zero-shot approach that adapts LLM-based embedding models for multilingual tasks without requiring multilingual supervision. LUSIFER's architecture combines a multilingual encoder, serving as a language-universal learner, with an LLM-based embedding model optimized for embedding-specific tasks. We introduce a new benchmark encompassing 5 primary embedding tasks, 123 diverse datasets, and coverage across 14 languages.
arXiv Detail & Related papers (2025-01-01T15:43:07Z)
Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models [55.25892137362187]
We introduce a new dataset featuring Multimodal Multi-Grained Concept annotations (MMGiC) for MLLMs. Our analyses reveal that multi-grained concept annotations integrate and complement each other, under our structured template and a general MLLM framework. We further validate our hypothesis by investigating the fair comparison and effective collaboration between MMGiC and image--caption data on 12 multimodal comprehension and generation benchmarks.
arXiv Detail & Related papers (2024-12-08T13:45:44Z)
OneLLM: One Framework to Align All Modalities with Language [90.14915575477197]
We present OneLLM, an MLLM that aligns eight modalities to language using a unified framework. OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning.
arXiv Detail & Related papers (2023-12-06T18:59:19Z)
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models [86.478087039015]
We present a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings. Based on our proposed joint mixing, we propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images. We hope our work may cast a light on the exploration of joint mixing in future MLLM research.
arXiv Detail & Related papers (2023-11-13T18:59:47Z)
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants [65.47222691674074]
Muffin framework employs pre-trained vision-language models to act as providers of visual signals. UniMM-Chat dataset explores the complementarities of datasets to generate 1.1M high-quality and diverse multimodal instructions.
arXiv Detail & Related papers (2023-10-01T12:35:18Z)
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models [50.07056960586183]
We propose Position-enhanced Visual Instruction Tuning (PVIT) to extend the functionality of Multimodal Large Language Models (MLLMs) This integration promotes a more detailed comprehension of images for the MLLM. We present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model.
arXiv Detail & Related papers (2023-08-25T15:33:47Z)
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration [50.94902442781148]
We propose a novel multi-modal large language model (LLM) that seamlessly integrates visual, audio, and textual information. Macaw-LLM consists of three main components: a modality module for encoding multi-modal data, a cognitive module for harnessing pretrained LLMs, and an alignment module for harmonizing diverse representations. We construct a large-scale multi-modal instruction dataset in terms of multi-turn dialogue, including 69K image instances and 50K video instances.
arXiv Detail & Related papers (2023-06-15T12:45:25Z)
Meta-learning For Vision-and-language Cross-lingual Transfer [14.594704809280984]
We propose a novel meta-learning fine-tuning framework for vison-language models. Our framework makes current PVLMs rapidly adaptive to new languages in vision-language scenarios. Our method boosts the performance of current state-of-the-art PVLMs in both zero-shot and few-shot cross-lingual transfer.
arXiv Detail & Related papers (2023-05-24T07:51:42Z)
LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding [34.42574051786547]
Multimodal pre-training with text, layout, and image has achieved SOTA performance for visually-rich document understanding tasks. We present a multimodal pre-trained model for multilingual document understanding, which aims to bridge the language barriers for visually-rich document understanding.
arXiv Detail & Related papers (2021-04-18T12:16:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.