Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and
Text Integration
- URL: http://arxiv.org/abs/2306.09093v1
- Date: Thu, 15 Jun 2023 12:45:25 GMT
- Title: Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and
Text Integration
- Authors: Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu,
Zefeng Du, Shuming Shi, Zhaopeng Tu
- Abstract summary: We propose a novel multi-modal large language model (LLM) that seamlessly integrates visual, audio, and textual information.
Macaw-LLM consists of three main components: a modality module for encoding multi-modal data, a cognitive module for harnessing pretrained LLMs, and an alignment module for harmonizing diverse representations.
We construct a large-scale multi-modal instruction dataset in terms of multi-turn dialogue, including 69K image instances and 50K video instances.
- Score: 50.94902442781148
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although instruction-tuned large language models (LLMs) have exhibited
remarkable capabilities across various NLP tasks, their effectiveness on other
data modalities beyond text has not been fully studied. In this work, we
propose Macaw-LLM, a novel multi-modal LLM that seamlessly integrates visual,
audio, and textual information. Macaw-LLM consists of three main components: a
modality module for encoding multi-modal data, a cognitive module for
harnessing pretrained LLMs, and an alignment module for harmonizing diverse
representations. Our novel alignment module seamlessly bridges multi-modal
features to textual features, simplifying the adaptation process from the
modality modules to the cognitive module. In addition, we construct a
large-scale multi-modal instruction dataset in terms of multi-turn dialogue,
including 69K image instances and 50K video instances. We have made our data,
code and model publicly available, which we hope can pave the way for future
research in multi-modal LLMs and expand the capabilities of LLMs to handle
diverse data modalities and address complex real-world scenarios.
Related papers
- Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback [38.708690624594794]
Video and text multimodal alignment remains challenging, primarily due to the deficient volume and quality of multimodal instruction-tune data.
We present a novel alignment strategy that employs multimodal AI system to oversee itself called Reinforcement Learning from AI Feedback (RLAIF)
In specific, we propose context-aware reward modeling by providing detailed video descriptions as context during the generation of preference feedback.
arXiv Detail & Related papers (2024-02-06T06:27:40Z) - MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples [63.78384552789171]
This paper introduces Multi-Modal In-Context Tuning (MMICT), a novel multi-modal fine-tuning paradigm.
We propose the Multi-Modal Hub (M-Hub), a unified module that captures various multi-modal features according to different inputs and objectives.
Based on M-Hub, MMICT enables MM-LLMs to learn from in-context visual-guided textual features and subsequently generate outputs conditioned on the textual-guided visual features.
arXiv Detail & Related papers (2023-12-11T13:11:04Z) - OneLLM: One Framework to Align All Modalities with Language [90.14915575477197]
We present OneLLM, an MLLM that aligns eight modalities to language using a unified framework.
OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning.
arXiv Detail & Related papers (2023-12-06T18:59:19Z) - On the Performance of Multimodal Language Models [4.677125897916577]
This study conducts a comparative analysis of different multimodal instruction tuning approaches.
We reveal key insights for guiding architectural choices when incorporating multimodal capabilities into large language models.
arXiv Detail & Related papers (2023-10-04T23:33:36Z) - Reformulating Vision-Language Foundation Models and Datasets Towards
Universal Multimodal Assistants [65.47222691674074]
Muffin framework employs pre-trained vision-language models to act as providers of visual signals.
UniMM-Chat dataset explores the complementarities of datasets to generate 1.1M high-quality and diverse multimodal instructions.
arXiv Detail & Related papers (2023-10-01T12:35:18Z) - AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model [33.072967177313025]
We present Any-Modality Augmented Language Model (AnyMAL), a unified model that reasons over diverse input modality signals.
AnyMAL inherits the powerful text-based reasoning abilities of the state-of-the-art LLMs including LLaMA-2 (70B)
We conduct comprehensive empirical analysis comprising both human and automatic evaluations, and demonstrate state-of-the-art performance on various multimodal tasks.
arXiv Detail & Related papers (2023-09-27T22:50:51Z) - mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [95.76661165594884]
mPLUG-Owl is a training paradigm that equips large language models (LLMs) with multi-modal abilities.
The training paradigm involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM.
Experimental results show that our model outperforms existing multi-modal models.
arXiv Detail & Related papers (2023-04-27T13:27:01Z) - mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image
and Video [89.19867891570945]
mPLUG-2 is a new unified paradigm with modularized design for multi-modal pretraining.
It shares common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.
It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video.
arXiv Detail & Related papers (2023-02-01T12:40:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.