Cross-Modal Coordination Across a Diverse Set of Input Modalities
- URL: http://arxiv.org/abs/2401.16347v1
- Date: Mon, 29 Jan 2024 17:53:25 GMT
- Title: Cross-Modal Coordination Across a Diverse Set of Input Modalities
- Authors: Jorge S\'anchez and Rodrigo Laguna
- Abstract summary: Cross-modal retrieval is the task of retrieving samples of a given modality by using queries of a different one.
This paper proposes two approaches to the problem: the first is based on an extension of the CLIP contrastive objective to an arbitrary number of input modalities.
The second departs from the contrastive formulation and tackles the coordination problem by regressing the cross-modal similarities towards a target.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-modal retrieval is the task of retrieving samples of a given modality
by using queries of a different one. Due to the wide range of practical
applications, the problem has been mainly focused on the vision and language
case, e.g. text to image retrieval, where models like CLIP have proven
effective in solving such tasks. The dominant approach to learning such
coordinated representations consists of projecting them onto a common space
where matching views stay close and those from non-matching pairs are pushed
away from each other. Although this cross-modal coordination has been applied
also to other pairwise combinations, extending it to an arbitrary number of
diverse modalities is a problem that has not been fully explored in the
literature. In this paper, we propose two different approaches to the problem.
The first is based on an extension of the CLIP contrastive objective to an
arbitrary number of input modalities, while the second departs from the
contrastive formulation and tackles the coordination problem by regressing the
cross-modal similarities towards a target that reflects two simple and
intuitive constraints of the cross-modal retrieval task. We run experiments on
two different datasets, over different combinations of input modalities and
show that the approach is not only simple and effective but also allows for
tackling the retrieval problem in novel ways. Besides capturing a more diverse
set of pair-wise interactions, we show that we can use the learned
representations to improve retrieval performance by combining the embeddings
from two or more such modalities.
Related papers
- GSSF: Generalized Structural Sparse Function for Deep Cross-modal Metric Learning [51.677086019209554]
We propose a Generalized Structural Sparse to capture powerful relationships across modalities for pair-wise similarity learning.
The distance metric delicately encapsulates two formats of diagonal and block-diagonal terms.
Experiments on cross-modal and two extra uni-modal retrieval tasks have validated its superiority and flexibility.
arXiv Detail & Related papers (2024-10-20T03:45:50Z) - Similarity-based Memory Enhanced Joint Entity and Relation Extraction [3.9659135716762894]
Document-level joint entity and relation extraction is a challenging information extraction problem.
We present a multi-task learning framework with bidirectional memory-like dependency between tasks.
Our empirical studies show that the proposed approach outperforms the existing methods.
arXiv Detail & Related papers (2023-07-14T12:26:56Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - Efficient Bilateral Cross-Modality Cluster Matching for Unsupervised Visible-Infrared Person ReID [56.573905143954015]
We propose a novel bilateral cluster matching-based learning framework to reduce the modality gap by matching cross-modality clusters.
Under such a supervisory signal, a Modality-Specific and Modality-Agnostic (MSMA) contrastive learning framework is proposed to align features jointly at a cluster-level.
Experiments on the public SYSU-MM01 and RegDB datasets demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2023-05-22T03:27:46Z) - Improving Cross-Modal Retrieval with Set of Diverse Embeddings [19.365974066256026]
Cross-modal retrieval across image and text modalities is a challenging task due to its inherent ambiguity.
Set-based embedding has been studied as a solution to this problem.
We present a novel set-based embedding method, which is distinct from previous work in two aspects.
arXiv Detail & Related papers (2022-11-30T05:59:23Z) - Probing Visual-Audio Representation for Video Highlight Detection via
Hard-Pairs Guided Contrastive Learning [23.472951216815765]
Key to effective video representations is cross-modal representation learning and fine-grained feature discrimination.
In this paper, we enrich intra-modality and cross-modality relations for representation modeling.
We enlarge the discriminative power of feature embedding with a hard-pairs guided contrastive learning scheme.
arXiv Detail & Related papers (2022-06-21T07:29:37Z) - Multi-Modal Mutual Information Maximization: A Novel Approach for
Unsupervised Deep Cross-Modal Hashing [73.29587731448345]
We propose a novel method, dubbed Cross-Modal Info-Max Hashing (CMIMH)
We learn informative representations that can preserve both intra- and inter-modal similarities.
The proposed method consistently outperforms other state-of-the-art cross-modal retrieval methods.
arXiv Detail & Related papers (2021-12-13T08:58:03Z) - Multi-task Supervised Learning via Cross-learning [102.64082402388192]
We consider a problem known as multi-task learning, consisting of fitting a set of regression functions intended for solving different tasks.
In our novel formulation, we couple the parameters of these functions, so that they learn in their task specific domains while staying close to each other.
This facilitates cross-fertilization in which data collected across different domains help improving the learning performance at each other task.
arXiv Detail & Related papers (2020-10-24T21:35:57Z) - Universal Weighting Metric Learning for Cross-Modal Matching [79.32133554506122]
Cross-modal matching has been a highlighted research topic in both vision and language areas.
We propose a simple and interpretable universal weighting framework for cross-modal matching.
arXiv Detail & Related papers (2020-10-07T13:16:45Z) - COBRA: Contrastive Bi-Modal Representation Algorithm [43.33840912256077]
We present a novel framework that aims to train two modalities in a joint fashion inspired by Contrastive Predictive Coding (CPC) and Noise Contrastive Estimation (NCE) paradigms.
We empirically show that this framework reduces the modality gap significantly and generates a robust and task agnostic joint-embedding space.
We outperform existing work on four diverse downstream tasks spanning across seven benchmark cross-modal datasets.
arXiv Detail & Related papers (2020-05-07T18:20:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.