Multi-Modal Mutual Information Maximization: A Novel Approach for
Unsupervised Deep Cross-Modal Hashing
- URL: http://arxiv.org/abs/2112.06489v1
- Date: Mon, 13 Dec 2021 08:58:03 GMT
- Title: Multi-Modal Mutual Information Maximization: A Novel Approach for
Unsupervised Deep Cross-Modal Hashing
- Authors: Tuan Hoang, Thanh-Toan Do, Tam V. Nguyen, Ngai-Man Cheung
- Abstract summary: We propose a novel method, dubbed Cross-Modal Info-Max Hashing (CMIMH)
We learn informative representations that can preserve both intra- and inter-modal similarities.
The proposed method consistently outperforms other state-of-the-art cross-modal retrieval methods.
- Score: 73.29587731448345
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we adopt the maximizing mutual information (MI) approach to
tackle the problem of unsupervised learning of binary hash codes for efficient
cross-modal retrieval. We proposed a novel method, dubbed Cross-Modal Info-Max
Hashing (CMIMH). First, to learn informative representations that can preserve
both intra- and inter-modal similarities, we leverage the recent advances in
estimating variational lower-bound of MI to maximize the MI between the binary
representations and input features and between binary representations of
different modalities. By jointly maximizing these MIs under the assumption that
the binary representations are modelled by multivariate Bernoulli
distributions, we can learn binary representations, which can preserve both
intra- and inter-modal similarities, effectively in a mini-batch manner with
gradient descent. Furthermore, we find out that trying to minimize the modality
gap by learning similar binary representations for the same instance from
different modalities could result in less informative representations. Hence,
balancing between reducing the modality gap and losing modality-private
information is important for the cross-modal retrieval tasks. Quantitative
evaluations on standard benchmark datasets demonstrate that the proposed method
consistently outperforms other state-of-the-art cross-modal retrieval methods.
Related papers
- Enhancing Unimodal Latent Representations in Multimodal VAEs through Iterative Amortized Inference [20.761803725098005]
Multimodal variational autoencoders (VAEs) aim to capture shared latent representations by integrating information from different data modalities.
A significant challenge is accurately inferring representations from any subset of modalities without training an impractical number of inference networks for all possible modality combinations.
We introduce multimodal iterative amortized inference, an iterative refinement mechanism within the multimodal VAE framework.
arXiv Detail & Related papers (2024-10-15T08:49:38Z) - MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling [64.09238330331195]
We propose a novel Multi-Modal Auto-Regressive (MMAR) probabilistic modeling framework.
Unlike discretization line of method, MMAR takes in continuous-valued image tokens to avoid information loss.
We show that MMAR demonstrates much more superior performance than other joint multi-modal models.
arXiv Detail & Related papers (2024-10-14T17:57:18Z) - Masked Contrastive Reconstruction for Cross-modal Medical Image-Report
Retrieval [3.5314225883644945]
Cross-modal medical image-report retrieval task plays a significant role in clinical diagnosis and various medical generative tasks.
We propose an efficient framework named Masked Contrastive and Reconstruction (MCR), which takes masked data as the sole input for both tasks.
This enhances task connections, reducing information interference and competition between them, while also substantially decreasing the required GPU memory and training time.
arXiv Detail & Related papers (2023-12-26T01:14:10Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Multimodal Information Bottleneck: Learning Minimal Sufficient Unimodal
and Multimodal Representations [27.855467591358018]
We introduce the multimodal information bottleneck (MIB), aiming to learn a powerful and sufficient multimodal representation.
We develop three MIB variants, namely, early-fusion MIB, late-fusion MIB, and complete MIB, to focus on different perspectives of information constraints.
Experimental results suggest that the proposed method reaches state-of-the-art performance on the tasks of multimodal sentiment analysis and multimodal emotion recognition.
arXiv Detail & Related papers (2022-10-31T16:14:18Z) - Probing Visual-Audio Representation for Video Highlight Detection via
Hard-Pairs Guided Contrastive Learning [23.472951216815765]
Key to effective video representations is cross-modal representation learning and fine-grained feature discrimination.
In this paper, we enrich intra-modality and cross-modality relations for representation modeling.
We enlarge the discriminative power of feature embedding with a hard-pairs guided contrastive learning scheme.
arXiv Detail & Related papers (2022-06-21T07:29:37Z) - Variational Distillation for Multi-View Learning [104.17551354374821]
We design several variational information bottlenecks to exploit two key characteristics for multi-view representation learning.
Under rigorously theoretical guarantee, our approach enables IB to grasp the intrinsic correlation between observations and semantic labels.
arXiv Detail & Related papers (2022-06-20T03:09:46Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - Learning Multimodal VAEs through Mutual Supervision [72.77685889312889]
MEME combines information between modalities implicitly through mutual supervision.
We demonstrate that MEME outperforms baselines on standard metrics across both partial and complete observation schemes.
arXiv Detail & Related papers (2021-06-23T17:54:35Z) - Robust Latent Representations via Cross-Modal Translation and Alignment [36.67937514793215]
Most multi-modal machine learning methods require that all the modalities used for training are also available for testing.
To address this limitation, we aim to improve the testing performance of uni-modal systems using multiple modalities during training only.
The proposed multi-modal training framework uses cross-modal translation and correlation-based latent space alignment.
arXiv Detail & Related papers (2020-11-03T11:18:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.