Generative Cross-Modal Retrieval: Memorizing Images in Multimodal
  Language Models for Retrieval and Beyond
        - URL: http://arxiv.org/abs/2402.10805v1
- Date: Fri, 16 Feb 2024 16:31:46 GMT
- Title: Generative Cross-Modal Retrieval: Memorizing Images in Multimodal
  Language Models for Retrieval and Beyond
- Authors: Yongqi Li, Wenjie Wang, Leigang Qu, Liqiang Nie, Wenjie Li, Tat-Seng
  Chua
- Abstract summary: We introduce a generative cross-modal retrieval framework, which assigns unique identifier strings to represent images.
By memorizing images in MLLMs, we introduce a new paradigm to cross-modal retrieval, distinct from previous discriminative approaches.
- Score: 99.73306923465424
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   The recent advancements in generative language models have demonstrated their
ability to memorize knowledge from documents and recall knowledge to respond to
user queries effectively. Building upon this capability, we propose to enable
multimodal large language models (MLLMs) to memorize and recall images within
their parameters. Given a user query for visual content, the MLLM is
anticipated to "recall" the relevant image from its parameters as the response.
Achieving this target presents notable challenges, including inbuilt visual
memory and visual recall schemes within MLLMs. To address these challenges, we
introduce a generative cross-modal retrieval framework, which assigns unique
identifier strings to represent images and involves two training steps:
learning to memorize and learning to retrieve. The first step focuses on
training the MLLM to memorize the association between images and their
respective identifiers. The latter step teaches the MLLM to generate the
corresponding identifier of the target image, given the textual query input. By
memorizing images in MLLMs, we introduce a new paradigm to cross-modal
retrieval, distinct from previous discriminative approaches. The experiments
demonstrate that the generative paradigm performs effectively and efficiently
even with large-scale image candidate sets.
 
      
        Related papers
        - MLLM-Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed   Image Retrieval [50.062817677022586]
 Zero-Shot Image Retrieval (ZS-CIR) methods typically train adapters that convert reference images into pseudo-text tokens.<n>We propose MLLM-Guided VLM Fine-Tuning with Joint Inference (MVFT-JI) to construct two complementary training tasks using only unlabeled images.
 arXiv  Detail & Related papers  (2025-05-26T08:56:59Z)
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
 We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
 arXiv  Detail & Related papers  (2025-02-18T12:00:47Z)
- When Large Vision-Language Models Meet Person Re-Identification [44.604485649167216]
 We propose LVLM-ReID, a novel framework that harnesses the strengths of LVLMs to promote ReID.
Our framework integrates the semantic understanding and generation capabilities of LVLMs into end-to-end ReID training.
Our method achieves competitive results on multiple benchmarks without additional image-text annotations.
 arXiv  Detail & Related papers  (2024-11-27T07:45:25Z)
- RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping   Language-Image Pre-training [55.54020926284334]
 Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
 arXiv  Detail & Related papers  (2024-10-18T03:45:19Z)
- CLIP-SCGI: Synthesized Caption-Guided Inversion for Person   Re-Identification [9.996589403019675]
 Person re-identification (ReID) has recently benefited from large pretrained vision-language models such as Contrastive Language-Image Pre-Training (CLIP)
We propose one straightforward solution by leveraging existing image captioning models to generate pseudo captions for person images.
We introduce CLIP-SCGI, a framework that leverages synthesized captions to guide the learning of discriminative and robust representations.
 arXiv  Detail & Related papers  (2024-10-12T06:24:33Z)
- Rethinking Visual Prompting for Multimodal Large Language Models with   External Knowledge [76.45868419402265]
 multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
 arXiv  Detail & Related papers  (2024-07-05T17:43:30Z)
- Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to   Comprehend What You Want [58.091825321168514]
 We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
 arXiv  Detail & Related papers  (2024-03-29T16:26:20Z)
- RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition [78.97487780589574]
 Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories.
This paper introduces a Retrieving And Ranking augmented method for MLLMs.
Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
 arXiv  Detail & Related papers  (2024-03-20T17:59:55Z)
- Déjà Vu Memorization in Vision-Language Models [39.51189095703773]
 We propose a new method for measuring memorization in Vision-Language Models (VLMs)
We show that the model indeed retains information about individual objects in the training images beyond what can be inferred from correlations or the image caption.
We evaluate d'eja vu memorization at both sample and population level, and show that it is significant for OpenCLIP trained on as many as 50M image-caption pairs.
 arXiv  Detail & Related papers  (2024-02-03T09:55:35Z)
- MLLMs-Augmented Visual-Language Representation Learning [70.5293060238008]
 We demonstrate that Multi-modal Large Language Models (MLLMs) can enhance visual-language representation learning.
Our approach is simple, utilizing MLLMs to extend multiple diverse captions for each image.
We propose "text shearing" to maintain the quality and availability of extended captions.
 arXiv  Detail & Related papers  (2023-11-30T18:05:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.