CoVLM: Composing Visual Entities and Relationships in Large Language
Models Via Communicative Decoding
- URL: http://arxiv.org/abs/2311.03354v1
- Date: Mon, 6 Nov 2023 18:59:44 GMT
- Title: CoVLM: Composing Visual Entities and Relationships in Large Language
Models Via Communicative Decoding
- Authors: Junyan Li, Delin Chen, Yining Hong, Zhenfang Chen, Peihao Chen, Yikang
Shen, Chuang Gan
- Abstract summary: CoVLM can guide the LLM to explicitly compose visual entities and relationships among the text.
We propose CoVLM, which can guide the LLM to explicitly compose visual entities and relationships among the text.
- Score: 66.52659447360104
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A remarkable ability of human beings resides in compositional reasoning,
i.e., the capacity to make "infinite use of finite means". However, current
large vision-language foundation models (VLMs) fall short of such compositional
abilities due to their "bag-of-words" behaviors and inability to construct
words that correctly represent visual entities and the relations among the
entities. To this end, we propose CoVLM, which can guide the LLM to explicitly
compose visual entities and relationships among the text and dynamically
communicate with the vision encoder and detection network to achieve
vision-language communicative decoding. Specifically, we first devise a set of
novel communication tokens for the LLM, for dynamic communication between the
visual detection system and the language system. A communication token is
generated by the LLM following a visual entity or a relation, to inform the
detection network to propose regions that are relevant to the sentence
generated so far. The proposed regions-of-interests (ROIs) are then fed back
into the LLM for better language generation contingent on the relevant regions.
The LLM is thus able to compose the visual entities and relationships through
the communication tokens. The vision-to-language and language-to-vision
communication are iteratively performed until the entire sentence is generated.
Our framework seamlessly bridges the gap between visual perception and LLMs and
outperforms previous VLMs by a large margin on compositional reasoning
benchmarks (e.g., ~20% in HICO-DET mAP, ~14% in Cola top-1 accuracy, and ~3% on
ARO top-1 accuracy). We also achieve state-of-the-art performances on
traditional vision-language tasks such as referring expression comprehension
and visual question answering.
Related papers
- RelationVLM: Making Large Vision-Language Models Understand Visual Relations [66.70252936043688]
We present RelationVLM, a large vision-language model capable of comprehending various levels and types of relations whether across multiple images or within a video.
Specifically, we devise a multi-stage relation-aware training scheme and a series of corresponding data configuration strategies to bestow RelationVLM with the capabilities of understanding semantic relations.
arXiv Detail & Related papers (2024-03-19T15:01:19Z) - Large Language Models are Visual Reasoning Coordinators [144.67558375045755]
We propose a novel paradigm that coordinates multiple vision-language models for visual reasoning.
We show that our instruction tuning variant, Cola-FT, achieves state-of-the-art performance on visual question answering.
We also show that our in-context learning variant, Cola-Zero, exhibits competitive performance in zero and few-shot settings.
arXiv Detail & Related papers (2023-10-23T17:59:31Z) - Let Models Speak Ciphers: Multiagent Debate through Embeddings [84.20336971784495]
We introduce CIPHER (Communicative Inter-Model Protocol Through Embedding Representation) to address this issue.
By deviating from natural language, CIPHER offers an advantage of encoding a broader spectrum of information without any modification to the model weights.
This showcases the superiority and robustness of embeddings as an alternative "language" for communication among LLMs.
arXiv Detail & Related papers (2023-10-10T03:06:38Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - TouchStone: Evaluating Vision-Language Models by Language Models [91.69776377214814]
We propose an evaluation method that uses strong large language models as judges to comprehensively evaluate the various abilities of LVLMs.
We construct a comprehensive visual dialogue dataset TouchStone, consisting of open-world images and questions, covering five major categories of abilities and 27 subtasks.
We demonstrate that powerful LVLMs, such as GPT-4, can effectively score dialogue quality by leveraging their textual capabilities alone.
arXiv Detail & Related papers (2023-08-31T17:52:04Z) - Visually-Situated Natural Language Understanding with Contrastive
Reading Model and Frozen Large Language Models [24.456117679941816]
Contrastive Reading Model (Cream) is a novel neural architecture designed to enhance the language-image understanding capability of Large Language Models (LLMs)
Our approach bridges the gap between vision and language understanding, paving the way for the development of more sophisticated Document Intelligence Assistants.
arXiv Detail & Related papers (2023-05-24T11:59:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.