SwitchGPT: Adapting Large Language Models for Non-Text Outputs
- URL: http://arxiv.org/abs/2309.07623v1
- Date: Thu, 14 Sep 2023 11:38:23 GMT
- Title: SwitchGPT: Adapting Large Language Models for Non-Text Outputs
- Authors: Xinyu Wang, Bohan Zhuang, Qi Wu
- Abstract summary: Large Language Models (LLMs) are primarily trained on text-based datasets.
LLMs exhibit exceptional proficiencies in understanding and executing complex linguistic instructions via text outputs.
We propose a novel approach that evolves a text-based LLM into a multi-modal one.
- Score: 28.656227306028743
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs), primarily trained on text-based datasets,
exhibit exceptional proficiencies in understanding and executing complex
linguistic instructions via text outputs. However, they falter when requests to
generate non-text ones. Concurrently, modality conversion models, such as
text-to-image, despite generating high-quality images, suffer from a lack of
extensive textual pretraining. As a result, these models are only capable of
accommodating specific image descriptions rather than comprehending more
complex instructions. To bridge this gap, we propose a novel approach,
\methodname, from a modality conversion perspective that evolves a text-based
LLM into a multi-modal one. We specifically employ a minimal dataset to
instruct LLMs to recognize the intended output modality as directed by the
instructions. Consequently, the adapted LLM can effectively summon various
off-the-shelf modality conversion models from the model zoos to generate
non-text responses. This circumvents the necessity for complicated pretraining
that typically requires immense quantities of paired multi-modal data, while
simultaneously inheriting the extensive knowledge of LLMs and the ability of
high-quality generative models. To evaluate and compare the adapted multi-modal
LLM with its traditional counterparts, we have constructed a multi-modal
instruction benchmark that solicits diverse modality outputs. The experiment
results reveal that, with minimal training, LLMs can be conveniently adapted to
comprehend requests for non-text responses, thus achieving higher flexibility
in multi-modal scenarios. Code and data will be made available at
https://github.com/xinke-wang/SwitchGPT.
Related papers
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - LLMs can see and hear without any training [63.964888082106974]
MILS is a simple, training-free approach to imbue multimodal capabilities into your favorite LLM.
We establish a new state-of-the-art on emergent zero-shot image, video and audio captioning.
Being a gradient-free optimization approach, MILS can invert multimodal embeddings into text.
arXiv Detail & Related papers (2025-01-30T02:16:35Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation [21.154973705998945]
Existing methods leverage the text encoder of the CLIP model to represent input prompts.
Large Language Models (LLMs) offer multilingual input, accommodate longer context, and achieve superior text representation.
We propose a lightweight adapter that enables fast training of the text-to-image model using the textual representations from LLMs.
arXiv Detail & Related papers (2024-05-21T16:35:02Z) - Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID [44.372336186832584]
We study the transferable text-to-image ReID problem, where we train a model on our proposed large-scale database.
We obtain substantial training data via Multi-modal Large Language Models (MLLMs)
We introduce a novel method that automatically identifies words in a description that do not correspond with the image.
arXiv Detail & Related papers (2024-05-08T10:15:04Z) - ModaVerse: Efficiently Transforming Modalities with LLMs [25.49713745405194]
We introduce ModaVerse, a Multi-modal Large Language Model capable of comprehending and transforming content across various modalities.
We propose a novel Input/Output (I/O) alignment mechanism that operates directly at the level of natural language.
arXiv Detail & Related papers (2024-01-12T06:28:54Z) - TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models [69.49978333446538]
TEAL is an approach to treat the input from any modality as a token sequence.
It embeds the token sequence into a joint embedding space with a learnable embedding matrix.
Experiments show that TEAL achieves substantial improvements in multi-modal understanding.
arXiv Detail & Related papers (2023-11-08T10:34:16Z) - Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and
Text Integration [50.94902442781148]
We propose a novel multi-modal large language model (LLM) that seamlessly integrates visual, audio, and textual information.
Macaw-LLM consists of three main components: a modality module for encoding multi-modal data, a cognitive module for harnessing pretrained LLMs, and an alignment module for harmonizing diverse representations.
We construct a large-scale multi-modal instruction dataset in terms of multi-turn dialogue, including 69K image instances and 50K video instances.
arXiv Detail & Related papers (2023-06-15T12:45:25Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.