Composed Multi-modal Retrieval: A Survey of Approaches and Applications
- URL: http://arxiv.org/abs/2503.01334v1
- Date: Mon, 03 Mar 2025 09:18:43 GMT
- Title: Composed Multi-modal Retrieval: A Survey of Approaches and Applications
- Authors: Kun Zhang, Jingyu Li, Zhe Li, Jingjing Zhang,
- Abstract summary: Composed Multi-modal Retrieval (CMR) enables users to retrieve images or videos by integrating a reference visual input with textual modifications.<n>CMR is poised to become a pivotal technology in next-generation retrieval systems.
- Score: 17.316062338546544
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the rapid growth of multi-modal data from social media, short video platforms, and e-commerce, content-based retrieval has become essential for efficiently searching and utilizing heterogeneous information. Over time, retrieval techniques have evolved from Unimodal Retrieval (UR) to Cross-modal Retrieval (CR) and, more recently, to Composed Multi-modal Retrieval (CMR). CMR enables users to retrieve images or videos by integrating a reference visual input with textual modifications, enhancing search flexibility and precision. This paper provides a comprehensive review of CMR, covering its fundamental challenges, technical advancements, and categorization into supervised, zero-shot, and semi-supervised learning paradigms. We discuss key research directions, including data augmentation, model architecture, and loss optimization in supervised CMR, as well as transformation frameworks and external knowledge integration in zero-shot CMR. Additionally, we highlight the application potential of CMR in composed image retrieval, video retrieval, and person retrieval, which have significant implications for e-commerce, online search, and public security. Given its ability to refine and personalize search experiences, CMR is poised to become a pivotal technology in next-generation retrieval systems. A curated list of related works and resources is available at: https://github.com/kkzhang95/Awesome-Composed-Multi-modal-Retrieval
Related papers
- MultiConIR: Towards multi-condition Information Retrieval [57.6405602406446]
We introduce MultiConIR, the first benchmark designed to evaluate retrieval models in multi-condition scenarios.
We propose three tasks to assess retrieval and reranking models on multi-condition robustness, monotonic relevance ranking, and query format sensitivity.
arXiv Detail & Related papers (2025-03-11T05:02:03Z) - Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation [2.549112678136113]
Retrieval-Augmented Generation (RAG) mitigates issues by integrating external dynamic information enhancing factual and updated grounding.<n>Cross-modal alignment and reasoning introduce unique challenges to Multimodal RAG, distinguishing it from traditional unimodal RAG.<n>This survey lays the foundation for developing more capable and reliable AI systems.
arXiv Detail & Related papers (2025-02-12T22:33:41Z) - A Survey on Multimodal Recommender Systems: Recent Advances and Future Directions [16.652996189513658]
This paper comprehensively reviews recent research advancements in Multimodal Recommender Systems.
We introduce the existing MRS models by categorizing them into four key areas: Feature Extraction, Multimodal Fusion, and Loss Function.
We hope to contribute to developing a more sophisticated and effective multimodal recommender system.
arXiv Detail & Related papers (2025-01-22T12:00:35Z) - Retrieval-Enhanced Machine Learning: Synthesis and Opportunities [60.34182805429511]
Retrieval-enhancement can be extended to a broader spectrum of machine learning (ML)
This work introduces a formal framework of this paradigm, Retrieval-Enhanced Machine Learning (REML), by synthesizing the literature in various domains in ML with consistent notations which is missing from the current literature.
The goal of this work is to equip researchers across various disciplines with a comprehensive, formally structured framework of retrieval-enhanced models, thereby fostering interdisciplinary future research.
arXiv Detail & Related papers (2024-07-17T20:01:21Z) - An Interactive Multi-modal Query Answering System with Retrieval-Augmented Large Language Models [21.892975397847316]
We present an interactive Multi-modal Query Answering (MQA) system, empowered by our newly developed multi-modal retrieval framework and navigation graph index.
One notable aspect of MQA is its utilization of contrastive learning to assess the significance of different modalities.
The system achieves efficient retrieval through our advanced navigation graph index, refined using computational pruning techniques.
arXiv Detail & Related papers (2024-07-05T02:01:49Z) - Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits Multimodal Reasoning [49.3242278912771]
We introduce a novel multimodal RAG framework named RMR (Retrieval Meets Reasoning)
The RMR framework employs a bi-modal retrieval module to identify the most relevant question-answer pairs.
It significantly boosts the performance of various vision-language models across a spectrum of benchmark datasets.
arXiv Detail & Related papers (2024-05-31T14:23:49Z) - End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries.
We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion.
We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z) - Synergistic Interplay between Search and Large Language Models for
Information Retrieval [141.18083677333848]
InteR allows RMs to expand knowledge in queries using LLM-generated knowledge collections.
InteR achieves overall superior zero-shot retrieval performance compared to state-of-the-art methods.
arXiv Detail & Related papers (2023-05-12T11:58:15Z) - Multimodal Recommender Systems: A Survey [50.23505070348051]
Multimodal Recommender System (MRS) has attracted much attention from both academia and industry recently.
In this paper, we will give a comprehensive survey of the MRS models, mainly from technical views.
To access more details of the surveyed papers, such as implementation code, we open source a repository.
arXiv Detail & Related papers (2023-02-08T05:12:54Z) - A Comprehensive Empirical Study of Vision-Language Pre-trained Model for
Supervised Cross-Modal Retrieval [19.2650103482509]
Cross-Modal Retrieval (CMR) is an important research topic across multimodal computing and information retrieval.
We take CLIP as the current representative vision-language pre-trained model to conduct a comprehensive empirical study.
We propose a novel model CLIP4CMR that employs pre-trained CLIP as backbone network to perform supervised CMR.
arXiv Detail & Related papers (2022-01-08T06:00:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.