Probabilistic Embeddings for Cross-Modal Retrieval
- URL: http://arxiv.org/abs/2101.05068v1
- Date: Wed, 13 Jan 2021 13:58:00 GMT
- Title: Probabilistic Embeddings for Cross-Modal Retrieval
- Authors: Sanghyuk Chun, Seong Joon Oh, Rafael Sampaio de Rezende, Yannis
Kalantidis, Diane Larlus
- Abstract summary: Cross-modal retrieval methods build a common representation space for samples from multiple modalities.
In this paper, we argue that deterministic functions are not sufficiently powerful to capture such one-to-many correspondences.
Instead, we propose to use Probabilistic Cross-Modal Embedding (PCME), where samples from the different modalities are represented as probabilistic distributions in the common embedding space.
- Score: 38.04859099157609
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-modal retrieval methods build a common representation space for samples
from multiple modalities, typically from the vision and the language domains.
For images and their captions, the multiplicity of the correspondences makes
the task particularly challenging. Given an image (respectively a caption),
there are multiple captions (respectively images) that equally make sense. In
this paper, we argue that deterministic functions are not sufficiently powerful
to capture such one-to-many correspondences. Instead, we propose to use
Probabilistic Cross-Modal Embedding (PCME), where samples from the different
modalities are represented as probabilistic distributions in the common
embedding space. Since common benchmarks such as COCO suffer from
non-exhaustive annotations for cross-modal matches, we propose to additionally
evaluate retrieval on the CUB dataset, a smaller yet clean database where all
possible image-caption pairs are annotated. We extensively ablate PCME and
demonstrate that it not only improves the retrieval performance over its
deterministic counterpart, but also provides uncertainty estimates that render
the embeddings more interpretable.
Related papers
- FlowSDF: Flow Matching for Medical Image Segmentation Using Distance Transforms [60.195642571004804]
We propose FlowSDF, an image-guided conditional flow matching framework to represent the signed distance function (SDF)
By learning a vector field that is directly related to the probability path of a conditional distribution of SDFs, we can accurately sample from the distribution of segmentation masks.
arXiv Detail & Related papers (2024-05-28T11:47:12Z) - DEMO: A Statistical Perspective for Efficient Image-Text Matching [32.256725860652914]
We introduce Distribution-based Structure Mining with Consistency Learning (DEMO) for efficient image-text matching.
DEMO characterizes each image using multiple augmented views, which are considered as samples drawn from its intrinsic semantic distribution.
In addition, we introduce collaborative consistency learning which not only preserves the similarity structure in the Hamming space but also encourages consistency between retrieval distribution from different directions.
arXiv Detail & Related papers (2024-05-19T09:38:56Z) - ProTA: Probabilistic Token Aggregation for Text-Video Retrieval [15.891020334480826]
We propose Probabilistic Token Aggregation (ProTA) to handle cross-modal interaction with content asymmetry.
ProTA achieves significant improvements on MSR-VTT (50.9%), LSMDC (25.8%), and DiDeMo (47.2%)
arXiv Detail & Related papers (2024-04-18T14:20:30Z) - ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models [69.50316788263433]
We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained vision-language models.
We quantify the calibration of embedding uncertainties in retrieval tasks and show that ProbVLM outperforms other methods.
We present a novel technique for visualizing the embedding distributions using a large-scale pre-trained latent diffusion model.
arXiv Detail & Related papers (2023-07-01T18:16:06Z) - Uncertainty-based Cross-Modal Retrieval with Probabilistic
Representations [18.560958487332265]
Probabilistic embeddings have proven useful for capturing polysemous word meanings, as well as ambiguity in image matching.
We propose a simple approach that replaces the standard vector point embeddings in extant image-text matching models with probabilistic distributions that are parametrically learned.
arXiv Detail & Related papers (2022-04-20T07:24:20Z) - Probabilistic Compositional Embeddings for Multimodal Image Retrieval [48.450232527041436]
We investigate a more challenging scenario for composing multiple multimodal queries in image retrieval.
Given an arbitrary number of query images and (or) texts, our goal is to retrieve target images containing the semantic concepts specified in multiple multimodal queries.
We propose a novel multimodal probabilistic composer (MPC) to learn an informative embedding that can flexibly encode the semantics of various queries.
arXiv Detail & Related papers (2022-04-12T14:45:37Z) - Cross-Modality Sub-Image Retrieval using Contrastive Multimodal Image
Representations [3.3754780158324564]
Cross-modality image retrieval is challenging, since images of similar (or even the same) content captured by different modalities might share few common structures.
We propose a new application-independent content-based image retrieval system for reverse (sub-)image search across modalities.
arXiv Detail & Related papers (2022-01-10T19:04:28Z) - Exploring Set Similarity for Dense Self-supervised Representation
Learning [96.35286140203407]
We propose to explore textbfset textbfsimilarity (SetSim) for dense self-supervised representation learning.
We generalize pixel-wise similarity learning to set-wise one to improve the robustness because sets contain more semantic and structure information.
Specifically, by resorting to attentional features of views, we establish corresponding sets, thus filtering out noisy backgrounds that may cause incorrect correspondences.
arXiv Detail & Related papers (2021-07-19T09:38:27Z) - Prototype Mixture Models for Few-shot Semantic Segmentation [50.866870384596446]
Few-shot segmentation is challenging because objects within the support and query images could significantly differ in appearance and pose.
We propose prototype mixture models (PMMs), which correlate diverse image regions with multiple prototypes to enforce the prototype-based semantic representation.
PMMs improve 5-shot segmentation performance on MS-COCO by up to 5.82% with only a moderate cost for model size and inference speed.
arXiv Detail & Related papers (2020-08-10T04:33:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.