ChatBridge: Bridging Modalities with Large Language Model as a Language
Catalyst
- URL: http://arxiv.org/abs/2305.16103v1
- Date: Thu, 25 May 2023 14:34:08 GMT
- Title: ChatBridge: Bridging Modalities with Large Language Model as a Language
Catalyst
- Authors: Zijia Zhao, Longteng Guo, Tongtian Yue, Sihan Chen, Shuai Shao, Xinxin
Zhu, Zehuan Yuan, Jing Liu
- Abstract summary: ChatBridge is a novel multimodal language model that leverages the expressive capabilities of language to bridge the gap between various modalities.
All codes, data, and models of ChatBridge will be open-sourced.
- Score: 24.517389691825667
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Building general-purpose models that can perceive diverse real-world
modalities and solve various tasks is an appealing target in artificial
intelligence. In this paper, we present ChatBridge, a novel multimodal language
model that leverages the expressive capabilities of language as the catalyst to
bridge the gap between various modalities. We show that only language-paired
two-modality data is sufficient to connect all modalities. ChatBridge leverages
recent large language models (LLM) and extends their zero-shot capabilities to
incorporate diverse multimodal inputs. ChatBridge undergoes a two-stage
training. The first stage aligns each modality with language, which brings
emergent multimodal correlation and collaboration abilities. The second stage
instruction-finetunes ChatBridge to align it with user intent with our newly
proposed multimodal instruction tuning dataset, named MULTIS, which covers a
wide range of 16 multimodal tasks of text, image, video, and audio modalities.
We show strong quantitative and qualitative results on zero-shot multimodal
tasks covering text, image, video, and audio modalities. All codes, data, and
models of ChatBridge will be open-sourced.
Related papers
- AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling [115.89786751297348]
We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities.
We build a multimodal text-centric dataset for multimodal alignment pre-training.
We show that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities.
arXiv Detail & Related papers (2024-02-19T15:33:10Z) - Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
Language, Audio, and Action [46.76487873983082]
Unified-IO 2 is the first autoregressive multimodal model capable of understanding and generating image, text, audio, and action.
We train our model from scratch on a large multimodal pre-training corpus from diverse sources.
With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark.
arXiv Detail & Related papers (2023-12-28T17:57:06Z) - CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation [88.33780780220091]
CoDi-2 is a versatile and interactive Multimodal Large Language Model (MLLM)
It can follow complex multimodal interleaved instructions, conduct in-context learning (ICL), reason, chat, edit, etc.
arXiv Detail & Related papers (2023-11-30T18:21:25Z) - TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities.
Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model.
To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z) - Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and
Text Integration [50.94902442781148]
We propose a novel multi-modal large language model (LLM) that seamlessly integrates visual, audio, and textual information.
Macaw-LLM consists of three main components: a modality module for encoding multi-modal data, a cognitive module for harnessing pretrained LLMs, and an alignment module for harmonizing diverse representations.
We construct a large-scale multi-modal instruction dataset in terms of multi-turn dialogue, including 69K image instances and 50K video instances.
arXiv Detail & Related papers (2023-06-15T12:45:25Z) - TextMI: Textualize Multimodal Information for Integrating Non-verbal
Cues in Pre-trained Language Models [5.668457303716451]
We propose TextMI as a general, competitive baseline for multimodal behavioral analysis tasks.
Our approach significantly reduces model complexity, adds interpretability to the model's decision, and can be applied for a diverse set of tasks.
arXiv Detail & Related papers (2023-03-27T17:54:32Z) - PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models.
We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks.
Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z) - Language Is Not All You Need: Aligning Perception with Language Models [110.51362453720458]
We introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context, and follow instructions.
We train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data.
Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP.
We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language
arXiv Detail & Related papers (2023-02-27T18:55:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.