SwitchGPT: Adapting Large Language Models for Non-Text Outputs
- URL: http://arxiv.org/abs/2309.07623v1
- Date: Thu, 14 Sep 2023 11:38:23 GMT
- Title: SwitchGPT: Adapting Large Language Models for Non-Text Outputs
- Authors: Xinyu Wang, Bohan Zhuang, Qi Wu
- Abstract summary: Large Language Models (LLMs) are primarily trained on text-based datasets.
LLMs exhibit exceptional proficiencies in understanding and executing complex linguistic instructions via text outputs.
We propose a novel approach that evolves a text-based LLM into a multi-modal one.
- Score: 28.656227306028743
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs), primarily trained on text-based datasets,
exhibit exceptional proficiencies in understanding and executing complex
linguistic instructions via text outputs. However, they falter when requests to
generate non-text ones. Concurrently, modality conversion models, such as
text-to-image, despite generating high-quality images, suffer from a lack of
extensive textual pretraining. As a result, these models are only capable of
accommodating specific image descriptions rather than comprehending more
complex instructions. To bridge this gap, we propose a novel approach,
\methodname, from a modality conversion perspective that evolves a text-based
LLM into a multi-modal one. We specifically employ a minimal dataset to
instruct LLMs to recognize the intended output modality as directed by the
instructions. Consequently, the adapted LLM can effectively summon various
off-the-shelf modality conversion models from the model zoos to generate
non-text responses. This circumvents the necessity for complicated pretraining
that typically requires immense quantities of paired multi-modal data, while
simultaneously inheriting the extensive knowledge of LLMs and the ability of
high-quality generative models. To evaluate and compare the adapted multi-modal
LLM with its traditional counterparts, we have constructed a multi-modal
instruction benchmark that solicits diverse modality outputs. The experiment
results reveal that, with minimal training, LLMs can be conveniently adapted to
comprehend requests for non-text responses, thus achieving higher flexibility
in multi-modal scenarios. Code and data will be made available at
https://github.com/xinke-wang/SwitchGPT.
Related papers
- Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation [21.154973705998945]
Existing methods leverage the text encoder of the CLIP model to represent input prompts.
Large Language Models (LLMs) offer multilingual input, accommodate longer context, and achieve superior text representation.
We propose a lightweight adapter that enables fast training of the text-to-image model using the textual representations from LLMs.
arXiv Detail & Related papers (2024-05-21T16:35:02Z) - Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID [44.372336186832584]
We study the transferable text-to-image ReID problem, where we train a model on our proposed large-scale database.
We obtain substantial training data via Multi-modal Large Language Models (MLLMs)
We introduce a novel method that automatically identifies words in a description that do not correspond with the image.
arXiv Detail & Related papers (2024-05-08T10:15:04Z) - ModaVerse: Efficiently Transforming Modalities with LLMs [25.49713745405194]
We introduce ModaVerse, a Multi-modal Large Language Model capable of comprehending and transforming content across various modalities.
We propose a novel Input/Output (I/O) alignment mechanism that operates directly at the level of natural language.
arXiv Detail & Related papers (2024-01-12T06:28:54Z) - Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels? [158.96530466189986]
multimodal large language models (MLLMs) have shown promising instruction following capabilities on vision-language tasks.
We investigate how well multimodal models can understand textual instructions provided in pixels, despite not being explicitly trained on such data during pretraining or fine-tuning.
We train v-MLLM, a generalizable model that is capable to conduct robust instruction following in both text-modality and visual-modality instructions.
arXiv Detail & Related papers (2023-11-29T14:08:53Z) - TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models [69.49978333446538]
TEAL is an approach to treat the input from any modality as a token sequence.
It embeds the token sequence into a joint embedding space with a learnable embedding matrix.
Experiments show that TEAL achieves substantial improvements in multi-modal understanding.
arXiv Detail & Related papers (2023-11-08T10:34:16Z) - Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and
Text Integration [50.94902442781148]
We propose a novel multi-modal large language model (LLM) that seamlessly integrates visual, audio, and textual information.
Macaw-LLM consists of three main components: a modality module for encoding multi-modal data, a cognitive module for harnessing pretrained LLMs, and an alignment module for harmonizing diverse representations.
We construct a large-scale multi-modal instruction dataset in terms of multi-turn dialogue, including 69K image instances and 50K video instances.
arXiv Detail & Related papers (2023-06-15T12:45:25Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [95.76661165594884]
mPLUG-Owl is a training paradigm that equips large language models (LLMs) with multi-modal abilities.
The training paradigm involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM.
Experimental results show that our model outperforms existing multi-modal models.
arXiv Detail & Related papers (2023-04-27T13:27:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.