LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model
- URL: http://arxiv.org/abs/2401.02330v4
- Date: Thu, 22 Feb 2024 07:12:44 GMT
- Title: LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model
- Authors: Yichen Zhu, Minjie Zhu, Ning Liu, Zhicai Ou, Xiaofeng Mou, Jian Tang
- Abstract summary: We introduce LLaVA-$phi$ (LLaVA-Phi), an efficient multi-modal assistant.
LLaVA-Phi harnesses the power of the recently advanced small language model, Phi-2.
- Score: 20.209674713676872
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we introduce LLaVA-$\phi$ (LLaVA-Phi), an efficient
multi-modal assistant that harnesses the power of the recently advanced small
language model, Phi-2, to facilitate multi-modal dialogues. LLaVA-Phi marks a
notable advancement in the realm of compact multi-modal models. It demonstrates
that even smaller language models, with as few as 2.7B parameters, can
effectively engage in intricate dialogues that integrate both textual and
visual elements, provided they are trained with high-quality corpora. Our model
delivers commendable performance on publicly available benchmarks that
encompass visual comprehension, reasoning, and knowledge-based perception.
Beyond its remarkable performance in multi-modal dialogue tasks, our model
opens new avenues for applications in time-sensitive environments and systems
that require real-time interaction, such as embodied agents. It highlights the
potential of smaller language models to achieve sophisticated levels of
understanding and interaction, while maintaining greater resource
efficiency.The project is available at {https://github.com/zhuyiche/llava-phi}.
Related papers
- Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities [0.0]
Mini- Omni2 is a visual-audio assistant capable of providing real-time, end-to-end voice responses to visoin and audio queries.
We propose a three-stage training process to align modalities, allowing the language model to handle multi-modal inputs and outputs after training on a limited dataset.
arXiv Detail & Related papers (2024-10-15T02:10:45Z) - S3: A Simple Strong Sample-effective Multimodal Dialog System [61.31055673156622]
We present a conceptually simple yet powerful baseline for the multimodal dialog task, an S3 model, that achieves near state-of-the-art results.
The system is based on a pre-trained large language model, pre-trained modality encoders for image and audio, and a trainable modality projector.
arXiv Detail & Related papers (2024-06-26T12:45:43Z) - LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models [50.259006481656094]
We present a novel interactive application aimed towards understanding the internal mechanisms of large vision-language models.
Our interface is designed to enhance the interpretability of the image patches, which are instrumental in generating an answer.
We present a case study of how our application can aid in understanding failure mechanisms in a popular large multi-modal model: LLaVA.
arXiv Detail & Related papers (2024-04-03T23:57:34Z) - Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models [25.724995114710165]
We investigate the design aspects of Multimodal Small Language Models (MSLMs) and propose an efficient multimodal assistant named Mipha.
Our Mipha-3B outperforms the state-of-the-art large MLLMs, especially LLaVA-1.5-13B, on multiple benchmarks.
arXiv Detail & Related papers (2024-03-10T12:43:27Z) - TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones [18.954681684239358]
This study introduces TinyGPT-V, a novel open-source MLLM, designed for efficient training and inference across various vision-language tasks.
With its language model 2.8 billion parameters, TinyGPT-V achieves comparable results in VQA and image inference tasks to its larger counterparts.
arXiv Detail & Related papers (2023-12-28T07:11:41Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with
Modality Collaboration [74.31268379055201]
mPLUG-Owl2 is a versatile multi-modal large language model.
It effectively leverages modality collaboration to improve performance in both text and multi-modal tasks.
arXiv Detail & Related papers (2023-11-07T14:21:29Z) - LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation,
Generation and Editing [99.80742991922992]
The system can have multi-turn dialogues with human users by taking multimodal user inputs and generating multimodal responses.
LLaVA-Interactive goes beyond language prompt, where visual prompt is enabled to align human intents in the interaction.
arXiv Detail & Related papers (2023-11-01T15:13:43Z) - DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via
Multi-Modal Causal Attention [55.2825684201129]
DeepSpeed-VisualChat is designed to optimize Large Language Models (LLMs) by incorporating multi-modal capabilities.
Our framework is notable for (1) its open-source support for multi-round and multi-image dialogues, (2) introducing an innovative multi-modal causal attention mechanism, and (3) utilizing data blending techniques on existing datasets to assure seamless interactions.
arXiv Detail & Related papers (2023-09-25T17:53:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.