Learning Multi-Modal Nonlinear Embeddings: Performance Bounds and an
Algorithm
- URL: http://arxiv.org/abs/2006.02330v2
- Date: Thu, 24 Dec 2020 22:01:04 GMT
- Title: Learning Multi-Modal Nonlinear Embeddings: Performance Bounds and an
Algorithm
- Authors: Semih Kaya and Elif Vural
- Abstract summary: We present a theoretical analysis of learning multi-modal nonlinear embeddings in a supervised setting.
We then propose a multi-modal nonlinear representation learning algorithm that is motivated by these theoretical findings.
Experimental comparison to recent multi-modal and single-modal learning algorithms suggests that the proposed method yields promising performance in multi-modal image classification and cross-modal image-text retrieval applications.
- Score: 8.528384027684192
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While many approaches exist in the literature to learn low-dimensional
representations for data collections in multiple modalities, the
generalizability of multi-modal nonlinear embeddings to previously unseen data
is a rather overlooked subject. In this work, we first present a theoretical
analysis of learning multi-modal nonlinear embeddings in a supervised setting.
Our performance bounds indicate that for successful generalization in
multi-modal classification and retrieval problems, the regularity of the
interpolation functions extending the embedding to the whole data space is as
important as the between-class separation and cross-modal alignment criteria.
We then propose a multi-modal nonlinear representation learning algorithm that
is motivated by these theoretical findings, where the embeddings of the
training samples are optimized jointly with the Lipschitz regularity of the
interpolators. Experimental comparison to recent multi-modal and single-modal
learning algorithms suggests that the proposed method yields promising
performance in multi-modal image classification and cross-modal image-text
retrieval applications.
Related papers
- Mutual Information-based Representations Disentanglement for Unaligned Multimodal Language Sequences [25.73415065546444]
Key challenge in unaligned multimodal language sequences is to integrate information from various modalities to obtain a refined multimodal joint representation.
We propose a Mutual Information-based Representations Disentanglement (MIRD) method for unaligned multimodal language sequences.
arXiv Detail & Related papers (2024-09-19T02:12:26Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - One-step Multi-view Clustering with Diverse Representation [47.41455937479201]
We propose a one-step multi-view clustering with diverse representation method, which incorporates multi-view learning and $k$-means into a unified framework.
We develop an efficient optimization algorithm with proven convergence to solve the resultant problem.
arXiv Detail & Related papers (2023-06-08T02:52:24Z) - Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications [90.6849884683226]
We study the challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data.
Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds.
We show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.
arXiv Detail & Related papers (2023-06-07T15:44:53Z) - Synergies between Disentanglement and Sparsity: Generalization and
Identifiability in Multi-Task Learning [79.83792914684985]
We prove a new identifiability result that provides conditions under which maximally sparse base-predictors yield disentangled representations.
Motivated by this theoretical result, we propose a practical approach to learn disentangled representations based on a sparsity-promoting bi-level optimization problem.
arXiv Detail & Related papers (2022-11-26T21:02:09Z) - Unified Multi-View Orthonormal Non-Negative Graph Based Clustering
Framework [74.25493157757943]
We formulate a novel clustering model, which exploits the non-negative feature property and incorporates the multi-view information into a unified joint learning framework.
We also explore, for the first time, the multi-model non-negative graph-based approach to clustering data based on deep features.
arXiv Detail & Related papers (2022-11-03T08:18:27Z) - Hybrid Contrastive Learning of Tri-Modal Representation for Multimodal
Sentiment Analysis [18.4364234071951]
We propose a novel framework HyCon for hybrid contrastive learning of tri-modal representation.
Specifically, we simultaneously perform intra-/inter-modal contrastive learning and semi-contrastive learning.
Our proposed method outperforms existing works.
arXiv Detail & Related papers (2021-09-04T06:04:21Z) - Deep Class-Specific Affinity-Guided Convolutional Network for Multimodal
Unpaired Image Segmentation [7.021001169318551]
Multi-modal medical image segmentation plays an essential role in clinical diagnosis.
It remains challenging as the input modalities are often not well-aligned spatially.
We propose an affinity-guided fully convolutional network for multimodal image segmentation.
arXiv Detail & Related papers (2021-01-05T13:56:51Z) - Robust Latent Representations via Cross-Modal Translation and Alignment [36.67937514793215]
Most multi-modal machine learning methods require that all the modalities used for training are also available for testing.
To address this limitation, we aim to improve the testing performance of uni-modal systems using multiple modalities during training only.
The proposed multi-modal training framework uses cross-modal translation and correlation-based latent space alignment.
arXiv Detail & Related papers (2020-11-03T11:18:04Z) - Unsupervised Multi-view Clustering by Squeezing Hybrid Knowledge from
Cross View and Each View [68.88732535086338]
This paper proposes a new multi-view clustering method, low-rank subspace multi-view clustering based on adaptive graph regularization.
Experimental results for five widely used multi-view benchmarks show that our proposed algorithm surpasses other state-of-the-art methods by a clear margin.
arXiv Detail & Related papers (2020-08-23T08:25:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.