Related papers: Kosmos-2: Grounding Multimodal Large Language Models to the World

Kosmos-2: Grounding Multimodal Large Language Models to the World

URL: http://arxiv.org/abs/2306.14824v3
Date: Thu, 13 Jul 2023 05:41:34 GMT
Title: Kosmos-2: Grounding Multimodal Large Language Models to the World
Authors: Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei
Abstract summary: We introduce Kosmos-2, a Multimodal Large Language Model (MLLM) It enables new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world. Code and pretrained models are available at https://aka.ms/kosmos-2.
Score: 107.27280175398089
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world. Specifically, we represent refer expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where object descriptions are sequences of location tokens. Together with multimodal corpora, we construct large-scale data of grounded image-text pairs (called GrIT) to train the model. In addition to the existing capabilities of MLLMs (e.g., perceiving general modalities, following instructions, and performing in-context learning), Kosmos-2 integrates the grounding capability into downstream applications. We evaluate Kosmos-2 on a wide range of tasks, including (i) multimodal grounding, such as referring expression comprehension, and phrase grounding, (ii) multimodal referring, such as referring expression generation, (iii) perception-language tasks, and (iv) language understanding and generation. This work lays out the foundation for the development of Embodiment AI and sheds light on the big convergence of language, multimodal perception, action, and world modeling, which is a key step toward artificial general intelligence. Code and pretrained models are available at https://aka.ms/kosmos-2.

Related papers

Large Concept Models: Language Modeling in a Sentence Representation Space [62.73366944266477]
We present an attempt at an architecture which operates on an explicit higher-level semantic representation, which we name a concept. Concepts are language- and modality-agnostic and represent a higher level idea or action in a flow. We show that our model exhibits impressive zero-shot generalization performance to many languages.
arXiv Detail & Related papers (2024-12-11T23:36:20Z)
GenRL: Multimodal-foundation world models for generalization in embodied agents [12.263162194821787]
Reinforcement learning (RL) is hard to scale up as it requires a complex reward design for each task. Current foundation vision-language models (VLMs) require fine-tuning or other adaptations to be adopted in embodied contexts. Lack of multimodal data in such domains represents an obstacle to developing foundation models for embodied applications.
arXiv Detail & Related papers (2024-06-26T03:41:48Z)
ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension [71.03445074045092]
We propose ClawMachine, offering a new methodology that explicitly notates each entity using token collectives groups of visual tokens. Our method unifies the prompt and answer of visual referential tasks without using additional syntax. ClawMachine achieves superior performance on scene-level and referential understanding tasks with higher efficiency.
arXiv Detail & Related papers (2024-06-17T08:39:16Z)
Probing Multimodal Large Language Models for Global and Local Semantic Representations [57.25949445963422]
We study which layers of Multimodal Large Language Models make the most effort to the global image information. In this study, we find that the intermediate layers of models can encode more global semantic information. We find that the topmost layers may excessively focus on local information, leading to a diminished ability to encode global information.
arXiv Detail & Related papers (2024-02-27T08:27:15Z)
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation [22.347590874621865]
We introduce GROUNDHOG, an MLLM developed by grounding Large Language Models to holistic segmentation. GROUNDHOG incorporates a masked feature extractor and converts extracted features into visual entity tokens for the MLLM backbone. Our experimental results show that GROUNDHOG achieves superior performance on various language grounding tasks without task-specific fine-tuning.
arXiv Detail & Related papers (2024-02-26T18:59:33Z)
CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation [88.33780780220091]
CoDi-2 is a versatile and interactive Multimodal Large Language Model (MLLM) It can follow complex multimodal interleaved instructions, conduct in-context learning (ICL), reason, chat, edit, etc.
arXiv Detail & Related papers (2023-11-30T18:21:25Z)
PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models. We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks. Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z)
Language Is Not All You Need: Aligning Perception with Language Models [110.51362453720458]
We introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context, and follow instructions. We train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP. We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language
arXiv Detail & Related papers (2023-02-27T18:55:27Z)
Exploiting BERT For Multimodal Target SentimentClassification Through Input Space Translation [75.82110684355979]
We introduce a two-stream model that translates images in input space using an object-aware transformer. We then leverage the translation to construct an auxiliary sentence that provides multimodal information to a language model. We achieve state-of-the-art performance on two multimodal Twitter datasets.
arXiv Detail & Related papers (2021-08-03T18:02:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.