Kosmos-2: Grounding Multimodal Large Language Models to the World
- URL: http://arxiv.org/abs/2306.14824v3
- Date: Thu, 13 Jul 2023 05:41:34 GMT
- Title: Kosmos-2: Grounding Multimodal Large Language Models to the World
- Authors: Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming
Ma, Furu Wei
- Abstract summary: We introduce Kosmos-2, a Multimodal Large Language Model (MLLM)
It enables new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world.
Code and pretrained models are available at https://aka.ms/kosmos-2.
- Score: 107.27280175398089
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new
capabilities of perceiving object descriptions (e.g., bounding boxes) and
grounding text to the visual world. Specifically, we represent refer
expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where
object descriptions are sequences of location tokens. Together with multimodal
corpora, we construct large-scale data of grounded image-text pairs (called
GrIT) to train the model. In addition to the existing capabilities of MLLMs
(e.g., perceiving general modalities, following instructions, and performing
in-context learning), Kosmos-2 integrates the grounding capability into
downstream applications. We evaluate Kosmos-2 on a wide range of tasks,
including (i) multimodal grounding, such as referring expression comprehension,
and phrase grounding, (ii) multimodal referring, such as referring expression
generation, (iii) perception-language tasks, and (iv) language understanding
and generation. This work lays out the foundation for the development of
Embodiment AI and sheds light on the big convergence of language, multimodal
perception, action, and world modeling, which is a key step toward artificial
general intelligence. Code and pretrained models are available at
https://aka.ms/kosmos-2.
Related papers
- GenRL: Multimodal-foundation world models for generalization in embodied agents [12.263162194821787]
Reinforcement learning (RL) is hard to scale up as it requires a complex reward design for each task.
Current foundation vision-language models (VLMs) require fine-tuning or other adaptations to be adopted in embodied contexts.
Lack of multimodal data in such domains represents an obstacle to developing foundation models for embodied applications.
arXiv Detail & Related papers (2024-06-26T03:41:48Z) - Probing Multimodal Large Language Models for Global and Local Semantic Representations [57.25949445963422]
We study which layers of Multimodal Large Language Models make the most effort to the global image information.
In this study, we find that the intermediate layers of models can encode more global semantic information.
We find that the topmost layers may excessively focus on local information, leading to a diminished ability to encode global information.
arXiv Detail & Related papers (2024-02-27T08:27:15Z) - GROUNDHOG: Grounding Large Language Models to Holistic Segmentation [22.347590874621865]
We introduce GROUNDHOG, an MLLM developed by grounding Large Language Models to holistic segmentation.
GROUNDHOG incorporates a masked feature extractor and converts extracted features into visual entity tokens for the MLLM backbone.
Our experimental results show that GROUNDHOG achieves superior performance on various language grounding tasks without task-specific fine-tuning.
arXiv Detail & Related papers (2024-02-26T18:59:33Z) - CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation [88.33780780220091]
CoDi-2 is a versatile and interactive Multimodal Large Language Model (MLLM)
It can follow complex multimodal interleaved instructions, conduct in-context learning (ICL), reason, chat, edit, etc.
arXiv Detail & Related papers (2023-11-30T18:21:25Z) - PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models.
We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks.
Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z) - Language Is Not All You Need: Aligning Perception with Language Models [110.51362453720458]
We introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context, and follow instructions.
We train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data.
Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP.
We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language
arXiv Detail & Related papers (2023-02-27T18:55:27Z) - Exploiting BERT For Multimodal Target SentimentClassification Through
Input Space Translation [75.82110684355979]
We introduce a two-stream model that translates images in input space using an object-aware transformer.
We then leverage the translation to construct an auxiliary sentence that provides multimodal information to a language model.
We achieve state-of-the-art performance on two multimodal Twitter datasets.
arXiv Detail & Related papers (2021-08-03T18:02:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.