Reminding Multimodal Large Language Models of Object-aware Knowledge with Retrieved Tags
- URL: http://arxiv.org/abs/2406.10839v1
- Date: Sun, 16 Jun 2024 08:20:12 GMT
- Title: Reminding Multimodal Large Language Models of Object-aware Knowledge with Retrieved Tags
- Authors: Daiqing Qi, Handong Zhao, Zijun Wei, Sheng Li,
- Abstract summary: Multimodal Large Language Models (MLLMs) struggle with critical problems when required to provide a precise and detailed response to a visual instruction.
We show effectiveness in mitigating these issues, but at an expensive cost of collecting a vast amount of new data.
We propose to enhance the mapping with retrieval-augmented tag tokens, which contain rich object-aware information.
- Score: 28.368960723666458
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite recent advances in the general visual instruction-following ability of Multimodal Large Language Models (MLLMs), they still struggle with critical problems when required to provide a precise and detailed response to a visual instruction: (1) failure to identify novel objects or entities, (2) mention of non-existent objects, and (3) neglect of object's attributed details. Intuitive solutions include improving the size and quality of data or using larger foundation models. They show effectiveness in mitigating these issues, but at an expensive cost of collecting a vast amount of new data and introducing a significantly larger model. Standing at the intersection of these approaches, we examine the three object-oriented problems from the perspective of the image-to-text mapping process by the multimodal connector. In this paper, we first identify the limitations of multimodal connectors stemming from insufficient training data. Driven by this, we propose to enhance the mapping with retrieval-augmented tag tokens, which contain rich object-aware information such as object names and attributes. With our Tag-grounded visual instruction tuning with retrieval Augmentation (TUNA), we outperform baselines that share the same language model and training data on 12 benchmarks. Furthermore, we show the zero-shot capability of TUNA when provided with specific datastores.
Related papers
- Chat-3D v2: Bridging 3D Scene and Large Language Models with Object
Identifiers [62.232809030044116]
We introduce the use of object identifiers to freely reference objects during a conversation.
We propose a two-stage alignment method, which involves learning an attribute-aware token and a relation-aware token for each object.
Experiments conducted on traditional datasets like ScanQA, ScanRefer, and Nr3D/Sr3D showcase the effectiveness of our proposed method.
arXiv Detail & Related papers (2023-12-13T14:27:45Z) - EXMODD: An EXplanatory Multimodal Open-Domain Dialogue dataset [20.445453185198186]
We propose a Multimodal Data Construction Framework (MDCF) to alleviate the significant human and resource expenditure in data collection.
MDCF automatically provides explanation for a given image and its corresponding dialogue, which can provide a certain degree of interpretability.
Experiments indicate a positive correlation between the model's ability to generate accurate understandings and high-quality responses.
arXiv Detail & Related papers (2023-10-17T03:28:29Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - DesCo: Learning Object Recognition with Rich Language Descriptions [93.8177229428617]
Recent development in vision-language approaches has instigated a paradigm shift in learning visual recognition models from language supervision.
We propose a new description-conditioned (DesCo) paradigm of learning object recognition models with rich language descriptions.
arXiv Detail & Related papers (2023-06-24T21:05:02Z) - Contextual Object Detection with Multimodal Large Language Models [78.30374204127418]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks.
We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate.
We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z) - Representing Knowledge by Spans: A Knowledge-Enhanced Model for
Information Extraction [7.077412533545456]
We propose a new pre-trained model that learns representations of both entities and relationships simultaneously.
By encoding spans efficiently with span modules, our model can represent both entities and their relationships but requires fewer parameters than existing models.
arXiv Detail & Related papers (2022-08-20T07:32:25Z) - MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces.
We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA)
Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.