Generative Multi-Modal Knowledge Retrieval with Large Language Models
- URL: http://arxiv.org/abs/2401.08206v1
- Date: Tue, 16 Jan 2024 08:44:29 GMT
- Title: Generative Multi-Modal Knowledge Retrieval with Large Language Models
- Authors: Xinwei Long, Jiali Zeng, Fandong Meng, Zhiyuan Ma, Kaiyan Zhang, Bowen
Zhou, Jie Zhou
- Abstract summary: We propose an innovative end-to-end generative framework for multi-modal knowledge retrieval.
Our framework takes advantage of the fact that large language models (LLMs) can effectively serve as virtual knowledge bases.
We demonstrate significant improvements ranging from 3.0% to 14.6% across all evaluation metrics when compared to strong baselines.
- Score: 75.70313858231833
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge retrieval with multi-modal queries plays a crucial role in
supporting knowledge-intensive multi-modal applications. However, existing
methods face challenges in terms of their effectiveness and training
efficiency, especially when it comes to training and integrating multiple
retrievers to handle multi-modal queries. In this paper, we propose an
innovative end-to-end generative framework for multi-modal knowledge retrieval.
Our framework takes advantage of the fact that large language models (LLMs) can
effectively serve as virtual knowledge bases, even when trained with limited
data. We retrieve knowledge via a two-step process: 1) generating knowledge
clues related to the queries, and 2) obtaining the relevant document by
searching databases using the knowledge clue. In particular, we first introduce
an object-aware prefix-tuning technique to guide multi-grained visual learning.
Then, we align multi-grained visual features into the textual feature space of
the LLM, employing the LLM to capture cross-modal interactions. Subsequently,
we construct instruction data with a unified format for model training.
Finally, we propose the knowledge-guided generation strategy to impose prior
constraints in the decoding steps, thereby promoting the generation of
distinctive knowledge clues. Through experiments conducted on three benchmarks,
we demonstrate significant improvements ranging from 3.0% to 14.6% across all
evaluation metrics when compared to strong baselines.
Related papers
- RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - RoRA-VLM: Robust Retrieval-Augmented Vision Language Models [41.09545760534495]
RORA-VLM is a novel and robust retrieval augmentation framework specifically tailored for vision-language models.
We conduct extensive experiments to validate the effectiveness and robustness of our proposed methods on three widely adopted benchmark datasets.
arXiv Detail & Related papers (2024-10-11T14:51:00Z) - Fine-tuning Multimodal Large Language Models for Product Bundling [53.01642741096356]
We introduce Bundle-MLLM, a novel framework that fine-tunes large language models (LLMs) through a hybrid item tokenization approach.
Specifically, we integrate textual, media, and relational data into a unified tokenization, introducing a soft separation token to distinguish between textual and non-textual tokens.
We propose a progressive optimization strategy that fine-tunes LLMs for disentangled objectives: 1) learning bundle patterns and 2) enhancing multimodal semantic understanding specific to product bundling.
arXiv Detail & Related papers (2024-07-16T13:30:14Z) - NoteLLM-2: Multimodal Large Representation Models for Recommendation [60.17448025069594]
We investigate the potential of Large Language Models to enhance multimodal representation in multimodal item-to-item recommendations.
One feasible method is the transfer of Multimodal Large Language Models (MLLMs) for representation tasks.
We propose a novel training framework, NoteLLM-2, specifically designed for multimodal representation.
arXiv Detail & Related papers (2024-05-27T03:24:01Z) - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z) - Multimodality Representation Learning: A Survey on Evolution,
Pretraining and Its Applications [47.501121601856795]
Multimodality Representation Learning is a technique of learning to embed information from different modalities and their correlations.
Cross-modal interaction and complementary information from different modalities are crucial for advanced models to perform any multimodal task.
This survey presents the literature on the evolution and enhancement of deep learning multimodal architectures.
arXiv Detail & Related papers (2023-02-01T11:48:34Z) - A Unified Continuous Learning Framework for Multi-modal Knowledge
Discovery and Pre-training [73.7507857547549]
We propose to unify knowledge discovery and multi-modal pre-training in a continuous learning framework.
For knowledge discovery, a pre-trained model is used to identify cross-modal links on a graph.
For model pre-training, the knowledge graph is used as the external knowledge to guide the model updating.
arXiv Detail & Related papers (2022-06-11T16:05:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.