Probabilistic Compositional Embeddings for Multimodal Image Retrieval
- URL: http://arxiv.org/abs/2204.05845v1
- Date: Tue, 12 Apr 2022 14:45:37 GMT
- Title: Probabilistic Compositional Embeddings for Multimodal Image Retrieval
- Authors: Andrei Neculai, Yanbei Chen, Zeynep Akata
- Abstract summary: We investigate a more challenging scenario for composing multiple multimodal queries in image retrieval.
Given an arbitrary number of query images and (or) texts, our goal is to retrieve target images containing the semantic concepts specified in multiple multimodal queries.
We propose a novel multimodal probabilistic composer (MPC) to learn an informative embedding that can flexibly encode the semantics of various queries.
- Score: 48.450232527041436
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing works in image retrieval often consider retrieving images with one
or two query inputs, which do not generalize to multiple queries. In this work,
we investigate a more challenging scenario for composing multiple multimodal
queries in image retrieval. Given an arbitrary number of query images and (or)
texts, our goal is to retrieve target images containing the semantic concepts
specified in multiple multimodal queries. To learn an informative embedding
that can flexibly encode the semantics of various queries, we propose a novel
multimodal probabilistic composer (MPC). Specifically, we model input images
and texts as probabilistic embeddings, which can be further composed by a
probabilistic composition rule to facilitate image retrieval with multiple
multimodal queries. We propose a new benchmark based on the MS-COCO dataset and
evaluate our model on various setups that compose multiple images and (or) text
queries for multimodal image retrieval. Without bells and whistles, we show
that our probabilistic model formulation significantly outperforms existing
related methods on multimodal image retrieval while generalizing well to query
with different amounts of inputs given in arbitrary visual and (or) textual
modalities. Code is available here: https://github.com/andreineculai/MPC.
Related papers
- MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs [78.5013630951288]
This paper introduces techniques for advancing information retrieval with multimodal large language models (MLLMs)
We first study fine-tuning an MLLM as a bi-encoder retriever on 10 datasets with 16 retrieval tasks.
We propose modality-aware hard negative mining to mitigate the modality bias exhibited by MLLM retrievers.
arXiv Detail & Related papers (2024-11-04T20:06:34Z) - Localizing Events in Videos with Multimodal Queries [71.40602125623668]
We introduce a new benchmark, ICQ, for localizing events in videos with multimodal queries.
We include 4 styles of reference images and 5 types of refinement texts, allowing us to explore model performance across different domains.
arXiv Detail & Related papers (2024-06-14T14:35:58Z) - Learning Comprehensive Representations with Richer Self for
Text-to-Image Person Re-Identification [34.289949134802086]
Text-to-image person re-identification (TIReID) retrieves pedestrian images of the same identity based on a query text.
Existing methods for TIReID typically treat it as a one-to-one image-text matching problem, only focusing on the relationship between image-text pairs within a view.
We propose a framework, called LCR$2$S, for modeling many-to-many correspondences of the same identity by learning representations for both modalities from a novel perspective.
arXiv Detail & Related papers (2023-10-17T12:39:16Z) - End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries.
We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion.
We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z) - Self-supervised Multi-view Disentanglement for Expansion of Visual
Collections [6.944742823561]
We consider the setting where a query for similar images is derived from a collection of images.
For visual search, the similarity measurements may be made along multiple axes, or views, such as style and color.
Our objective is to design a retrieval algorithm that effectively combines similarities computed over representations from multiple views.
arXiv Detail & Related papers (2023-02-04T22:09:17Z) - Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE.
We propose a novel Multi-modal Retrieval based framework (MoRe)
MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z) - Progressive Learning for Image Retrieval with Hybrid-Modality Queries [48.79599320198615]
Image retrieval with hybrid-modality queries, also known as composing text and image for image retrieval (CTI-IR)
We decompose the CTI-IR task into a three-stage learning problem to progressively learn the complex knowledge for image retrieval with hybrid-modality queries.
Our proposed model significantly outperforms state-of-the-art methods in the mean of Recall@K by 24.9% and 9.5% on the Fashion-IQ and Shoes benchmark datasets respectively.
arXiv Detail & Related papers (2022-04-24T08:10:06Z) - ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and
Implicit Similarity [16.550790981646276]
Current approaches combine the features of each of the two elements of the query into a single representation.
Our work aims at shedding new light on the task by looking at it through the prism of two familiar and related frameworks: text-to-image and image-to-image retrieval.
arXiv Detail & Related papers (2022-03-15T17:29:20Z) - Probabilistic Embeddings for Cross-Modal Retrieval [38.04859099157609]
Cross-modal retrieval methods build a common representation space for samples from multiple modalities.
In this paper, we argue that deterministic functions are not sufficiently powerful to capture such one-to-many correspondences.
Instead, we propose to use Probabilistic Cross-Modal Embedding (PCME), where samples from the different modalities are represented as probabilistic distributions in the common embedding space.
arXiv Detail & Related papers (2021-01-13T13:58:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.