Related papers: X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

URL: http://arxiv.org/abs/2305.04160v3
Date: Mon, 22 May 2023 02:37:02 GMT
Title: X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
Authors: Feilong Chen, Minglun Han, Haozhi Zhao, Qingyang Zhang, Jing Shi, Shuang Xu, Bo Xu
Abstract summary: We propose X-LLM, which converts Multi-modalities into foreign languages using X2L interfaces and inputs them into a large Language model (ChatGLM) X-LLM demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions.
Score: 20.274614342856978
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large language models (LLMs) have demonstrated remarkable language abilities. GPT-4, based on advanced LLMs, exhibits extraordinary multimodal capabilities beyond previous visual language models. We attribute this to the use of more advanced LLMs compared with previous multimodal models. Unfortunately, the model architecture and training strategies of GPT-4 are unknown. To endow LLMs with multimodal capabilities, we propose X-LLM, which converts Multi-modalities (images, speech, videos) into foreign languages using X2L interfaces and inputs them into a large Language model (ChatGLM). Specifically, X-LLM aligns multiple frozen single-modal encoders and a frozen LLM using X2L interfaces, where ``X'' denotes multi-modalities such as image, speech, and videos, and ``L'' denotes languages. X-LLM's training consists of three stages: (1) Converting Multimodal Information: The first stage trains each X2L interface to align with its respective single-modal encoder separately to convert multimodal information into languages. (2) Aligning X2L representations with the LLM: single-modal encoders are aligned with the LLM through X2L interfaces independently. (3) Integrating multiple modalities: all single-modal encoders are aligned with the LLM through X2L interfaces to integrate multimodal capabilities into the LLM. Our experiments show that X-LLM demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 84.5\% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. And we also conduct quantitative tests on using LLM for ASR and multimodal ASR, hoping to promote the era of LLM-based speech recognition.

Related papers

Liquid: Language Models are Scalable and Unified Multi-modal Generators [112.71734051183726]
Liquid is an auto-regressive generation paradigm that seamlessly integrates visual comprehension and generation. Unlike previous multimodal large language model (MLLM), Liquid achieves this integration using a single large language model. For the first time, Liquid uncovers a scaling law that performance drop unavoidably brought by the unified training of visual and language tasks.
arXiv Detail & Related papers (2024-12-05T16:48:16Z)
Push the Limit of Multi-modal Emotion Recognition by Prompting LLMs with Receptive-Field-Aware Attention Weighting [11.926100290196828]
We present a framework, Lantern, that can improve the performance of a certain vanilla model by prompting large language models with receptive-field-aware attention weighting. In the experiments, vanilla models CORECT and SDT are deployed in Lantern with GPT-4 or Llama-3.1-405B.
arXiv Detail & Related papers (2024-11-26T18:35:24Z)
LLaVA-KD: A Framework of Distilling Multimodal Large Language Models [70.19607283302712]
We propose a novel framework to transfer knowledge from l-MLLM to s-MLLM. Specifically, we introduce Multimodal Distillation (MDist) to minimize the divergence between the visual-textual output distributions of l-MLLM and s-MLLM. We also propose a three-stage training scheme to fully exploit the potential of s-MLLM.
arXiv Detail & Related papers (2024-10-21T17:41:28Z)
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models [97.40590590880144]
We develop an extensive Multimodality Large Language Model (MLLM) series. We assemble a comprehensive dataset covering publicly available resources in language, vision, and vision-language tasks. We obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities.
arXiv Detail & Related papers (2024-02-08T18:59:48Z)
OneLLM: One Framework to Align All Modalities with Language [90.14915575477197]
We present OneLLM, an MLLM that aligns eight modalities to language using a unified framework. OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning.
arXiv Detail & Related papers (2023-12-06T18:59:19Z)
Towards Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs [72.49064988035126]
We propose an approach called MKS2, aimed at enhancing multimodal large language models (MLLMs) Specifically, we introduce the Modular Visual Memory, a component integrated into the internal blocks of LLMs, designed to store open-world visual information efficiently. Our experiments demonstrate that MKS2 substantially augments the reasoning capabilities of LLMs in contexts necessitating physical or commonsense knowledge.
arXiv Detail & Related papers (2023-11-27T12:29:20Z)
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration [50.94902442781148]
We propose a novel multi-modal large language model (LLM) that seamlessly integrates visual, audio, and textual information. Macaw-LLM consists of three main components: a modality module for encoding multi-modal data, a cognitive module for harnessing pretrained LLMs, and an alignment module for harmonizing diverse representations. We construct a large-scale multi-modal instruction dataset in terms of multi-turn dialogue, including 69K image instances and 50K video instances.
arXiv Detail & Related papers (2023-06-15T12:45:25Z)
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [95.76661165594884]
mPLUG-Owl is a training paradigm that equips large language models (LLMs) with multi-modal abilities. The training paradigm involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM. Experimental results show that our model outperforms existing multi-modal models.
arXiv Detail & Related papers (2023-04-27T13:27:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.