Modality-Aware Integration with Large Language Models for
Knowledge-based Visual Question Answering
- URL: http://arxiv.org/abs/2402.12728v2
- Date: Sun, 3 Mar 2024 04:51:28 GMT
- Title: Modality-Aware Integration with Large Language Models for
Knowledge-based Visual Question Answering
- Authors: Junnan Dong, Qinggang Zhang, Huachi Zhou, Daochen Zha, Pai Zheng, Xiao
Huang
- Abstract summary: We present a novel modality-aware integration with large language models (LLMs) for KVQA (MAIL)
MAIL carefully leverages multimodal knowledge for both image understanding and knowledge reasoning.
Experiments on two benchmark datasets show the superiority of MAIL with 24x less resources.
- Score: 28.48844388792774
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge-based visual question answering (KVQA) has been extensively studied
to answer visual questions with external knowledge, e.g., knowledge graphs
(KGs). While several attempts have been proposed to leverage large language
models (LLMs) as an implicit knowledge source, it remains challenging since
LLMs may generate hallucinations. Moreover, multiple knowledge sources, e.g.,
images, KGs and LLMs, cannot be readily aligned for complex scenarios. To
tackle these, we present a novel modality-aware integration with LLMs for KVQA
(MAIL). It carefully leverages multimodal knowledge for both image
understanding and knowledge reasoning. Specifically, (i) we propose a two-stage
prompting strategy with LLMs to densely embody the image into a scene graph
with detailed visual features; (ii) We construct a coupled concept graph by
linking the mentioned entities with external facts. (iii) A tailored
pseudo-siamese graph medium fusion is designed for sufficient multimodal
fusion. We utilize the shared mentioned entities in two graphs as mediums to
bridge a tight inter-modal exchange, while maximally preserving insightful
intra-modal learning by constraining the fusion within mediums. Extensive
experiments on two benchmark datasets show the superiority of MAIL with 24x
less resources.
Related papers
- VAT-KG: Knowledge-Intensive Multimodal Knowledge Graph Dataset for Retrieval-Augmented Generation [3.1033038923749774]
We propose the first concept-centric and knowledge-intensive multimodal knowledge graph that covers visual, audio, and text information.<n>Our construction pipeline ensures cross-modal knowledge alignment between multimodal data and fine-grained semantics.<n>We introduce a novel multimodal RAG framework that retrieves detailed concept-level knowledge in response to queries from arbitrary modalities.
arXiv Detail & Related papers (2025-06-11T07:22:57Z) - Aligning Vision to Language: Text-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning [10.761218096540976]
Multimodal reasoning in Large Language Models (LLMs) struggles with incomplete knowledge and hallucination artifacts.
We propose Vision-align-to-Language integrated Knowledge Graph (VaLiK), a novel approach for constructing Multimodal Knowledge Graphs.
arXiv Detail & Related papers (2025-03-17T09:31:14Z) - Harnessing Large Language Models for Knowledge Graph Question Answering via Adaptive Multi-Aspect Retrieval-Augmentation [81.18701211912779]
We introduce an Adaptive Multi-Aspect Retrieval-augmented over KGs (Amar) framework.
This method retrieves knowledge including entities, relations, and subgraphs, and converts each piece of retrieved text into prompt embeddings.
Our method has achieved state-of-the-art performance on two common datasets.
arXiv Detail & Related papers (2024-12-24T16:38:04Z) - Distill Visual Chart Reasoning Ability from LLMs to MLLMs [38.62832112530892]
Solving complex chart Q&A tasks requires advanced visual reasoning abilities in multimodal large language models (MLLMs)
We propose Code-as-Intermediary Translation (CIT), a cost-effective, efficient and easily scalable data synthesis method for distilling visual reasoning abilities from LLMs to MLLMs.
We employ text-based synthesizing techniques to construct chart-plotting code and produce ReachQA, a dataset containing 3k reasoning-intensive charts and 20k Q&A pairs.
arXiv Detail & Related papers (2024-10-24T14:50:42Z) - RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - Fine-tuning Multimodal Large Language Models for Product Bundling [53.01642741096356]
We introduce Bundle-MLLM, a novel framework that fine-tunes large language models (LLMs) through a hybrid item tokenization approach.
Specifically, we integrate textual, media, and relational data into a unified tokenization, introducing a soft separation token to distinguish between textual and non-textual tokens.
We propose a progressive optimization strategy that fine-tunes LLMs for disentangled objectives: 1) learning bundle patterns and 2) enhancing multimodal semantic understanding specific to product bundling.
arXiv Detail & Related papers (2024-07-16T13:30:14Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - Multimodal Reasoning with Multimodal Knowledge Graph [19.899398342533722]
Multimodal reasoning with large language models (LLMs) often suffers from hallucinations and the presence of deficient or outdated knowledge.
We propose the Multimodal Reasoning with Multimodal Knowledge Graph (MR-MKG) method to learn rich and semantic knowledge across modalities.
arXiv Detail & Related papers (2024-06-04T07:13:23Z) - NoteLLM-2: Multimodal Large Representation Models for Recommendation [60.17448025069594]
We investigate the potential of Large Language Models to enhance multimodal representation in multimodal item-to-item recommendations.
One feasible method is the transfer of Multimodal Large Language Models (MLLMs) for representation tasks.
We propose a novel training framework, NoteLLM-2, specifically designed for multimodal representation.
arXiv Detail & Related papers (2024-05-27T03:24:01Z) - Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion [70.9767518332692]
Multimodal Large Language Models (MLLMs) that incorporate LLMs with pre-trained vision models have recently demonstrated impressive performance across diverse vision-language tasks.
However, they fall short to comprehend context involving multiple images.
We propose a two phase paradigm, browse-and-concentrate, to enable in-depth multimodal context fusion.
arXiv Detail & Related papers (2024-02-19T14:59:07Z) - LION : Empowering Multimodal Large Language Model with Dual-Level Visual
Knowledge [58.82222646803248]
Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals.
Most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge.
We propose a dual-Level vIsual knedgeOwl eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels.
arXiv Detail & Related papers (2023-11-20T15:56:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.