Universal Weighting Metric Learning for Cross-Modal Matching
- URL: http://arxiv.org/abs/2010.03403v1
- Date: Wed, 7 Oct 2020 13:16:45 GMT
- Title: Universal Weighting Metric Learning for Cross-Modal Matching
- Authors: Jiwei Wei, Xing Xu, Yang Yang, Yanli Ji, Zheng Wang, Heng Tao Shen
- Abstract summary: Cross-modal matching has been a highlighted research topic in both vision and language areas.
We propose a simple and interpretable universal weighting framework for cross-modal matching.
- Score: 79.32133554506122
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-modal matching has been a highlighted research topic in both vision and
language areas. Learning appropriate mining strategy to sample and weight
informative pairs is crucial for the cross-modal matching performance. However,
most existing metric learning methods are developed for unimodal matching,
which is unsuitable for cross-modal matching on multimodal data with
heterogeneous features. To address this problem, we propose a simple and
interpretable universal weighting framework for cross-modal matching, which
provides a tool to analyze the interpretability of various loss functions.
Furthermore, we introduce a new polynomial loss under the universal weighting
framework, which defines a weight function for the positive and negative
informative pairs respectively. Experimental results on two image-text matching
benchmarks and two video-text matching benchmarks validate the efficacy of the
proposed method.
Related papers
- GSSF: Generalized Structural Sparse Function for Deep Cross-modal Metric Learning [51.677086019209554]
We propose a Generalized Structural Sparse to capture powerful relationships across modalities for pair-wise similarity learning.
The distance metric delicately encapsulates two formats of diagonal and block-diagonal terms.
Experiments on cross-modal and two extra uni-modal retrieval tasks have validated its superiority and flexibility.
arXiv Detail & Related papers (2024-10-20T03:45:50Z) - Cross-Modal Coordination Across a Diverse Set of Input Modalities [0.0]
Cross-modal retrieval is the task of retrieving samples of a given modality by using queries of a different one.
This paper proposes two approaches to the problem: the first is based on an extension of the CLIP contrastive objective to an arbitrary number of input modalities.
The second departs from the contrastive formulation and tackles the coordination problem by regressing the cross-modal similarities towards a target.
arXiv Detail & Related papers (2024-01-29T17:53:25Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - Understanding and Constructing Latent Modality Structures in Multi-modal
Representation Learning [53.68371566336254]
We argue that the key to better performance lies in meaningful latent modality structures instead of perfect modality alignment.
Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization.
arXiv Detail & Related papers (2023-03-10T14:38:49Z) - Cross-modal Contrastive Learning for Multimodal Fake News Detection [10.760000041969139]
COOLANT is a cross-modal contrastive learning framework for multimodal fake news detection.
A cross-modal fusion module is developed to learn the cross-modality correlations.
An attention guidance module is implemented to help effectively and interpretably aggregate the aligned unimodal representations.
arXiv Detail & Related papers (2023-02-25T10:12:34Z) - Probing Visual-Audio Representation for Video Highlight Detection via
Hard-Pairs Guided Contrastive Learning [23.472951216815765]
Key to effective video representations is cross-modal representation learning and fine-grained feature discrimination.
In this paper, we enrich intra-modality and cross-modality relations for representation modeling.
We enlarge the discriminative power of feature embedding with a hard-pairs guided contrastive learning scheme.
arXiv Detail & Related papers (2022-06-21T07:29:37Z) - S2-Net: Self-supervision Guided Feature Representation Learning for
Cross-Modality Images [0.0]
Cross-modality image pairs often fail to make the feature representations of correspondences as close as possible.
In this letter, we design a cross-modality feature representation learning network, S2-Net, which is based on the recently successful detect-and-describe pipeline.
We introduce self-supervised learning with a well-designed loss function to guide the training without discarding the original advantages.
arXiv Detail & Related papers (2022-03-28T08:47:49Z) - On Modality Bias Recognition and Reduction [70.69194431713825]
We study the modality bias problem in the context of multi-modal classification.
We propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned.
Our method yields remarkable performance improvements compared with the baselines.
arXiv Detail & Related papers (2022-02-25T13:47:09Z) - Multi-Modal Mutual Information Maximization: A Novel Approach for
Unsupervised Deep Cross-Modal Hashing [73.29587731448345]
We propose a novel method, dubbed Cross-Modal Info-Max Hashing (CMIMH)
We learn informative representations that can preserve both intra- and inter-modal similarities.
The proposed method consistently outperforms other state-of-the-art cross-modal retrieval methods.
arXiv Detail & Related papers (2021-12-13T08:58:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.