Language Is Not All You Need: Aligning Perception with Language Models
- URL: http://arxiv.org/abs/2302.14045v2
- Date: Wed, 1 Mar 2023 11:04:51 GMT
- Title: Language Is Not All You Need: Aligning Perception with Language Models
- Authors: Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal,
Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, Qiang
Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit
Som, Xia Song, Furu Wei
- Abstract summary: We introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context, and follow instructions.
We train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data.
Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP.
We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language
- Score: 110.51362453720458
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A big convergence of language, multimodal perception, action, and world
modeling is a key step toward artificial general intelligence. In this work, we
introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive
general modalities, learn in context (i.e., few-shot), and follow instructions
(i.e., zero-shot). Specifically, we train Kosmos-1 from scratch on web-scale
multimodal corpora, including arbitrarily interleaved text and images,
image-caption pairs, and text data. We evaluate various settings, including
zero-shot, few-shot, and multimodal chain-of-thought prompting, on a wide range
of tasks without any gradient updates or finetuning. Experimental results show
that Kosmos-1 achieves impressive performance on (i) language understanding,
generation, and even OCR-free NLP (directly fed with document images), (ii)
perception-language tasks, including multimodal dialogue, image captioning,
visual question answering, and (iii) vision tasks, such as image recognition
with descriptions (specifying classification via text instructions). We also
show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge
from language to multimodal, and from multimodal to language. In addition, we
introduce a dataset of Raven IQ test, which diagnoses the nonverbal reasoning
capability of MLLMs.
Related papers
- VLIS: Unimodal Language Models Guide Multimodal Language Generation [23.094728230459125]
We introduce Visual-Language models as Importance Sampling weights (VLIS)
It combines the visual conditioning capability of vision-language models with the language understanding of unimodal text-only language models without further training.
VLIS improves vision-language models on diverse tasks, including commonsense understanding and complex text generation.
arXiv Detail & Related papers (2023-10-15T07:58:52Z) - Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task
Instruction Tuning [27.544311403607786]
We introduce the Ziya-Visual series, a set of bilingual large-scale vision-language models (LVLMs)
Our models adopt the Querying Transformer from BLIP-2, further exploring the assistance of optimization schemes.
In addition, we stimulate the understanding ability of GPT-4 in multi-modal scenarios, translating our gathered English image-text datasets into Chinese.
arXiv Detail & Related papers (2023-10-12T09:39:17Z) - Kosmos-G: Generating Images in Context with Multimodal Large Language Models [117.0259361818715]
Current subject-driven image generation methods require test-time tuning and cannot accept interleaved multi-image and text input.
This paper presents Kosmos-G, a model that leverages the advanced multimodal perception capabilities of Multimodal Large Language Models.
Kosmos-G demonstrates an impressive capability of zero-shot subject-driven generation with interleaved multi-image and text input.
arXiv Detail & Related papers (2023-10-04T17:28:44Z) - TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities.
Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model.
To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - Kosmos-2: Grounding Multimodal Large Language Models to the World [107.27280175398089]
We introduce Kosmos-2, a Multimodal Large Language Model (MLLM)
It enables new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world.
Code and pretrained models are available at https://aka.ms/kosmos-2.
arXiv Detail & Related papers (2023-06-26T16:32:47Z) - PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models.
We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks.
Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.