CUE-M: Contextual Understanding and Enhanced Search with Multimodal Large Language Model
- URL: http://arxiv.org/abs/2411.12287v1
- Date: Tue, 19 Nov 2024 07:16:48 GMT
- Title: CUE-M: Contextual Understanding and Enhanced Search with Multimodal Large Language Model
- Authors: Dongyoung Go, Taesun Whang, Chanhee Lee, Hwayeon Kim, Sunghoon Park, Seunghwan Ji, Dongchan Kim, Young-Bum Kim,
- Abstract summary: This paper introduces Con Understanding and Enhanced Search with MLLM (CUE-M), a novel multimodaltextual search pipeline.
Evaluations on a multimodal Q&A dataset and a public safety benchmark demonstrate that CUE-M outperforms baselines in accuracy, knowledge integration, and safety.
- Score: 9.451422869974557
- License:
- Abstract: The integration of Retrieval-Augmented Generation (RAG) with Multimodal Large Language Models (MLLMs) has expanded the scope of multimodal query resolution. However, current systems struggle with intent understanding, information retrieval, and safety filtering, limiting their effectiveness. This paper introduces Contextual Understanding and Enhanced Search with MLLM (CUE-M), a novel multimodal search pipeline that addresses these challenges through a multi-stage framework comprising image context enrichment, intent refinement, contextual query generation, external API integration, and relevance-based filtering. CUE-M incorporates a robust safety framework combining image-based, text-based, and multimodal classifiers, dynamically adapting to instance- and category-specific risks. Evaluations on a multimodal Q&A dataset and a public safety benchmark demonstrate that CUE-M outperforms baselines in accuracy, knowledge integration, and safety, advancing the capabilities of multimodal retrieval systems.
Related papers
- RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization [49.08348604716746]
Multimodal Summarization with Multimodal Output (MSMO) aims to produce a multimodal summary that integrates both text and relevant images.
In this paper, we propose an Entity-Guided Multimodal Summarization model (EGMS)
Our model, building on BART, utilizes dual multimodal encoders with shared weights to process text-image and entity-image information concurrently.
arXiv Detail & Related papers (2024-08-06T12:45:56Z) - A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks [74.52259252807191]
Multimodal Large Language Models (MLLMs) address the complexities of real-world applications far beyond the capabilities of single-modality systems.
This paper systematically sorts out the applications of MLLM in multimodal tasks such as natural language, vision, and audio.
arXiv Detail & Related papers (2024-08-02T15:14:53Z) - UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language Models [0.42832989850721054]
Multimodal Entities Linking (MEL) is a crucial task that aims at linking ambiguous mentions within multimodal contexts to referent entities in a multimodal knowledge base, such as Wikipedia.
Existing methods overcomplicate the MEL task and overlook the visual semantic information, which makes them costly and hard to scale.
We propose UniMEL, a unified framework which establishes a new paradigm to process multimodal entity linking tasks using Large Language Models.
arXiv Detail & Related papers (2024-07-23T03:58:08Z) - Fine-tuning Multimodal Large Language Models for Product Bundling [53.01642741096356]
We introduce Bundle-MLLM, a novel framework that fine-tunes large language models (LLMs) through a hybrid item tokenization approach.
Specifically, we integrate textual, media, and relational data into a unified tokenization, introducing a soft separation token to distinguish between textual and non-textual tokens.
We propose a progressive optimization strategy that fine-tunes LLMs for disentangled objectives: 1) learning bundle patterns and 2) enhancing multimodal semantic understanding specific to product bundling.
arXiv Detail & Related papers (2024-07-16T13:30:14Z) - An Interactive Multi-modal Query Answering System with Retrieval-Augmented Large Language Models [21.892975397847316]
We present an interactive Multi-modal Query Answering (MQA) system, empowered by our newly developed multi-modal retrieval framework and navigation graph index.
One notable aspect of MQA is its utilization of contrastive learning to assess the significance of different modalities.
The system achieves efficient retrieval through our advanced navigation graph index, refined using computational pruning techniques.
arXiv Detail & Related papers (2024-07-05T02:01:49Z) - Large AI Model Empowered Multimodal Semantic Communications [48.73159237649128]
We propose a Large AI Model-based Multimodal SC (LAMMSC) framework.
We first present the Conditional-based Multimodal Alignment (MMA) that enables the transformation between multimodal and unimodal data.
Then, a personalized LLM-based Knowledge Base (LKB) is proposed, which allows users to perform personalized semantic extraction or recovery.
Finally, we apply the Generative adversarial network-based channel Estimation (CGE) for estimating the wireless channel state information.
arXiv Detail & Related papers (2023-09-03T19:24:34Z) - MHMS: Multimodal Hierarchical Multimedia Summarization [80.18786847090522]
We propose a multimodal hierarchical multimedia summarization (MHMS) framework by interacting visual and language domains.
Our method contains video and textual segmentation and summarization module, respectively.
It formulates a cross-domain alignment objective with optimal transport distance to generate the representative and textual summary.
arXiv Detail & Related papers (2022-04-07T21:00:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.