Generative Cross-Modal Retrieval: Memorizing Images in Multimodal
Language Models for Retrieval and Beyond
- URL: http://arxiv.org/abs/2402.10805v1
- Date: Fri, 16 Feb 2024 16:31:46 GMT
- Title: Generative Cross-Modal Retrieval: Memorizing Images in Multimodal
Language Models for Retrieval and Beyond
- Authors: Yongqi Li, Wenjie Wang, Leigang Qu, Liqiang Nie, Wenjie Li, Tat-Seng
Chua
- Abstract summary: We introduce a generative cross-modal retrieval framework, which assigns unique identifier strings to represent images.
By memorizing images in MLLMs, we introduce a new paradigm to cross-modal retrieval, distinct from previous discriminative approaches.
- Score: 99.73306923465424
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent advancements in generative language models have demonstrated their
ability to memorize knowledge from documents and recall knowledge to respond to
user queries effectively. Building upon this capability, we propose to enable
multimodal large language models (MLLMs) to memorize and recall images within
their parameters. Given a user query for visual content, the MLLM is
anticipated to "recall" the relevant image from its parameters as the response.
Achieving this target presents notable challenges, including inbuilt visual
memory and visual recall schemes within MLLMs. To address these challenges, we
introduce a generative cross-modal retrieval framework, which assigns unique
identifier strings to represent images and involves two training steps:
learning to memorize and learning to retrieve. The first step focuses on
training the MLLM to memorize the association between images and their
respective identifiers. The latter step teaches the MLLM to generate the
corresponding identifier of the target image, given the textual query input. By
memorizing images in MLLMs, we introduce a new paradigm to cross-modal
retrieval, distinct from previous discriminative approaches. The experiments
demonstrate that the generative paradigm performs effectively and efficiently
even with large-scale image candidate sets.
Related papers
- IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model [52.697180472760635]
This paper explores the potential of character identities memory and recognition across multiple visual scenarios.
We propose visual instruction tuning with ID reference and develop an ID-Aware Large Vision-Language Model, IDA-VLM.
Our research introduces a novel benchmark MM-ID, to examine LVLMs on instance IDs memory and recognition across four dimensions.
arXiv Detail & Related papers (2024-07-10T12:11:59Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models [17.171715290673678]
We propose an interactive image retrieval system capable of refining queries based on user relevance feedback.
This system incorporates a vision language model (VLM) based image captioner to enhance the quality of text-based queries.
To evaluate our system, we curate a new dataset by adapting the MSR-VTT video retrieval dataset to the image retrieval task.
arXiv Detail & Related papers (2024-04-29T14:46:35Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition [78.97487780589574]
Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories.
This paper introduces a Retrieving And Ranking augmented method for MLLMs.
Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
arXiv Detail & Related papers (2024-03-20T17:59:55Z) - D\'ej\`a Vu Memorization in Vision-Language Models [44.40740575667872]
We propose a new method for measuring memorization in Vision-Language Models (VLMs)
We show that the model indeed retains information about individual objects in the training images beyond what can be inferred from correlations or the image caption.
We evaluate d'eja vu memorization at both sample and population level, and show that it is significant for OpenCLIP trained on as many as 50M image-caption pairs.
arXiv Detail & Related papers (2024-02-03T09:55:35Z) - MLLMs-Augmented Visual-Language Representation Learning [70.5293060238008]
We demonstrate that Multi-modal Large Language Models (MLLMs) can enhance visual-language representation learning.
Our approach is simple, utilizing MLLMs to extend multiple diverse captions for each image.
We propose "text shearing" to maintain the quality and availability of extended captions.
arXiv Detail & Related papers (2023-11-30T18:05:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.