Related papers: Generative Multi-Modal Knowledge Retrieval with Large Language Models

Generative Multi-Modal Knowledge Retrieval with Large Language Models

URL: http://arxiv.org/abs/2401.08206v1
Date: Tue, 16 Jan 2024 08:44:29 GMT
Title: Generative Multi-Modal Knowledge Retrieval with Large Language Models
Authors: Xinwei Long, Jiali Zeng, Fandong Meng, Zhiyuan Ma, Kaiyan Zhang, Bowen Zhou, Jie Zhou
Abstract summary: We propose an innovative end-to-end generative framework for multi-modal knowledge retrieval. Our framework takes advantage of the fact that large language models (LLMs) can effectively serve as virtual knowledge bases. We demonstrate significant improvements ranging from 3.0% to 14.6% across all evaluation metrics when compared to strong baselines.
Score: 75.70313858231833
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Knowledge retrieval with multi-modal queries plays a crucial role in supporting knowledge-intensive multi-modal applications. However, existing methods face challenges in terms of their effectiveness and training efficiency, especially when it comes to training and integrating multiple retrievers to handle multi-modal queries. In this paper, we propose an innovative end-to-end generative framework for multi-modal knowledge retrieval. Our framework takes advantage of the fact that large language models (LLMs) can effectively serve as virtual knowledge bases, even when trained with limited data. We retrieve knowledge via a two-step process: 1) generating knowledge clues related to the queries, and 2) obtaining the relevant document by searching databases using the knowledge clue. In particular, we first introduce an object-aware prefix-tuning technique to guide multi-grained visual learning. Then, we align multi-grained visual features into the textual feature space of the LLM, employing the LLM to capture cross-modal interactions. Subsequently, we construct instruction data with a unified format for model training. Finally, we propose the knowledge-guided generation strategy to impose prior constraints in the decoding steps, thereby promoting the generation of distinctive knowledge clues. Through experiments conducted on three benchmarks, we demonstrate significant improvements ranging from 3.0% to 14.6% across all evaluation metrics when compared to strong baselines.

Related papers

RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks. Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs. In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z)
RoRA-VLM: Robust Retrieval-Augmented Vision Language Models [41.09545760534495]
RORA-VLM is a novel and robust retrieval augmentation framework specifically tailored for vision-language models. We conduct extensive experiments to validate the effectiveness and robustness of our proposed methods on three widely adopted benchmark datasets.
arXiv Detail & Related papers (2024-10-11T14:51:00Z)
ModalPrompt:Dual-Modality Guided Prompt for Continual Learning of Large Multimodal Models [40.7613157799378]
Large Multimodal Models (LMMs) exhibit remarkable multi-tasking ability by learning mixed datasets jointly. Existing methods leverage data replay or model expansion, both of which are not specially developed for LMMs. We propose a novel dual-modality guided prompt learning framework (ModalPrompt) tailored for multimodal continual learning.
arXiv Detail & Related papers (2024-10-08T09:35:37Z)
Fine-tuning Multimodal Large Language Models for Product Bundling [53.01642741096356]
We introduce Bundle-MLLM, a novel framework that fine-tunes large language models (LLMs) through a hybrid item tokenization approach. Specifically, we integrate textual, media, and relational data into a unified tokenization, introducing a soft separation token to distinguish between textual and non-textual tokens. We propose a progressive optimization strategy that fine-tunes LLMs for disentangled objectives: 1) learning bundle patterns and 2) enhancing multimodal semantic understanding specific to product bundling.
arXiv Detail & Related papers (2024-07-16T13:30:14Z)
NoteLLM-2: Multimodal Large Representation Models for Recommendation [60.17448025069594]
We investigate the potential of Large Language Models to enhance multimodal representation in multimodal item-to-item recommendations. One feasible method is the transfer of Multimodal Large Language Models (MLLMs) for representation tasks. We propose a novel training framework, NoteLLM-2, specifically designed for multimodal representation.
arXiv Detail & Related papers (2024-05-27T03:24:01Z)
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark. Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs. We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z)
Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications [47.501121601856795]
Multimodality Representation Learning is a technique of learning to embed information from different modalities and their correlations. Cross-modal interaction and complementary information from different modalities are crucial for advanced models to perform any multimodal task. This survey presents the literature on the evolution and enhancement of deep learning multimodal architectures.
arXiv Detail & Related papers (2023-02-01T11:48:34Z)
A Unified Continuous Learning Framework for Multi-modal Knowledge Discovery and Pre-training [73.7507857547549]
We propose to unify knowledge discovery and multi-modal pre-training in a continuous learning framework. For knowledge discovery, a pre-trained model is used to identify cross-modal links on a graph. For model pre-training, the knowledge graph is used as the external knowledge to guide the model updating.
arXiv Detail & Related papers (2022-06-11T16:05:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.