A Survey of Multimodal Composite Editing and Retrieval
- URL: http://arxiv.org/abs/2409.05405v2
- Date: Wed, 11 Sep 2024 02:44:52 GMT
- Title: A Survey of Multimodal Composite Editing and Retrieval
- Authors: Suyan Li, Fuxiang Huang, Lei Zhang,
- Abstract summary: This survey is the first comprehensive review of the literature on multimodal composite retrieval.
It covers image-text composite editing, image-text composite retrieval, and other multimodal composite retrieval.
We systematically organize the application scenarios, methods, benchmarks, experiments, and future directions.
- Score: 7.966265020507201
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In the real world, where information is abundant and diverse across different modalities, understanding and utilizing various data types to improve retrieval systems is a key focus of research. Multimodal composite retrieval integrates diverse modalities such as text, image and audio, etc. to provide more accurate, personalized, and contextually relevant results. To facilitate a deeper understanding of this promising direction, this survey explores multimodal composite editing and retrieval in depth, covering image-text composite editing, image-text composite retrieval, and other multimodal composite retrieval. In this survey, we systematically organize the application scenarios, methods, benchmarks, experiments, and future directions. Multimodal learning is a hot topic in large model era, and have also witnessed some surveys in multimodal learning and vision-language models with transformers published in the PAMI journal. To the best of our knowledge, this survey is the first comprehensive review of the literature on multimodal composite retrieval, which is a timely complement of multimodal fusion to existing reviews. To help readers' quickly track this field, we build the project page for this survey, which can be found at https://github.com/fuxianghuang1/Multimodal-Composite-Editing-and-Retrieval.
Related papers
- MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs [78.5013630951288]
This paper introduces techniques for advancing information retrieval with multimodal large language models (MLLMs)
We first study fine-tuning an MLLM as a bi-encoder retriever on 10 datasets with 16 retrieval tasks.
We propose modality-aware hard negative mining to mitigate the modality bias exhibited by MLLM retrievers.
arXiv Detail & Related papers (2024-11-04T20:06:34Z) - Smart Multi-Modal Search: Contextual Sparse and Dense Embedding Integration in Adobe Express [3.8973445113342433]
Building a scalable multi-modal search system requires fine-tuning several components.
We address considerations such as embedding model selection, the roles of embeddings in matching and ranking, and the balance between dense and sparse embeddings.
arXiv Detail & Related papers (2024-08-26T23:52:27Z) - Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object
Detection [72.36017150922504]
We propose a multi-modal contextual knowledge distillation framework, MMC-Det, to transfer the learned contextual knowledge from a teacher fusion transformer to a student detector.
The diverse multi-modal masked language modeling is realized by an object divergence constraint upon traditional multi-modal masked language modeling (MLM)
arXiv Detail & Related papers (2023-08-30T08:33:13Z) - Multimodality Representation Learning: A Survey on Evolution,
Pretraining and Its Applications [47.501121601856795]
Multimodality Representation Learning is a technique of learning to embed information from different modalities and their correlations.
Cross-modal interaction and complementary information from different modalities are crucial for advanced models to perform any multimodal task.
This survey presents the literature on the evolution and enhancement of deep learning multimodal architectures.
arXiv Detail & Related papers (2023-02-01T11:48:34Z) - Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE.
We propose a novel Multi-modal Retrieval based framework (MoRe)
MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z) - Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge
Graph Completion [112.27103169303184]
Multimodal Knowledge Graphs (MKGs) organize visual-text factual knowledge.
MKGformer can obtain SOTA performance on four datasets of multimodal link prediction, multimodal RE, and multimodal NER.
arXiv Detail & Related papers (2022-05-04T23:40:04Z) - Multimodal Image Synthesis and Editing: The Generative AI Era [131.9569600472503]
multimodal image synthesis and editing has become a hot research topic in recent years.
We comprehensively contextualize the advance of the recent multimodal image synthesis and editing.
We describe benchmark datasets and evaluation metrics as well as corresponding experimental results.
arXiv Detail & Related papers (2021-12-27T10:00:16Z) - Multi-modal Summarization for Video-containing Documents [23.750585762568665]
We propose a novel multi-modal summarization task to summarize from a document and its associated video.
Comprehensive experiments show that the proposed model is beneficial for multi-modal summarization and superior to existing methods.
arXiv Detail & Related papers (2020-09-17T02:13:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.