Semi-Supervised Multi-Modal Multi-Instance Multi-Label Deep Network with
Optimal Transport
- URL: http://arxiv.org/abs/2104.08489v1
- Date: Sat, 17 Apr 2021 09:18:28 GMT
- Title: Semi-Supervised Multi-Modal Multi-Instance Multi-Label Deep Network with
Optimal Transport
- Authors: Yang Yang, Zhao-Yang Fu, De-Chuan Zhan, Zhi-Bin Liu, and Yuan Jiang
- Abstract summary: We propose a novel Multi-modal Multi-instance Multi-label Deep Network (M3DN)
M3DN considers M3 learning in an end-to-end multi-modal deep network and utilizes consistency principle among different modal bag-level predictions.
Thereby M3DNS can better predict label and exploit label correlation simultaneously.
- Score: 24.930976128926314
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Complex objects are usually with multiple labels, and can be represented by
multiple modal representations, e.g., the complex articles contain text and
image information as well as multiple annotations. Previous methods assume that
the homogeneous multi-modal data are consistent, while in real applications,
the raw data are disordered, e.g., the article constitutes with variable number
of inconsistent text and image instances. Therefore, Multi-modal Multi-instance
Multi-label (M3) learning provides a framework for handling such task and has
exhibited excellent performance. However, M3 learning is facing two main
challenges: 1) how to effectively utilize label correlation; 2) how to take
advantage of multi-modal learning to process unlabeled instances. To solve
these problems, we first propose a novel Multi-modal Multi-instance Multi-label
Deep Network (M3DN), which considers M3 learning in an end-to-end multi-modal
deep network and utilizes consistency principle among different modal bag-level
predictions. Based on the M3DN, we learn the latent ground label metric with
the optimal transport. Moreover, we introduce the extrinsic unlabeled
multi-modal multi-instance data, and propose the M3DNS, which considers the
instance-level auto-encoder for single modality and modified bag-level optimal
transport to strengthen the consistency among modalities. Thereby M3DNS can
better predict label and exploit label correlation simultaneously. Experiments
on benchmark datasets and real world WKG Game-Hub dataset validate the
effectiveness of the proposed methods.
Related papers
- OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces [67.07083389543799]
We present OmniBind, large-scale multimodal joint representation models ranging in scale from 7 billion to 30 billion parameters.
Due to the scarcity of data pairs across all modalities, instead of training large models from scratch, we propose remapping and binding the spaces of various pre-trained specialist models together.
Experiments demonstrate the versatility and superiority of OmniBind as an omni representation model, highlighting its great potential for diverse applications.
arXiv Detail & Related papers (2024-07-16T16:24:31Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Multi-modal Semantic Understanding with Contrastive Cross-modal Feature
Alignment [11.897888221717245]
This paper proposes a novel CLIP-guided contrastive-learning-based architecture to perform multi-modal feature alignment.
Our model is simple to implement without using task-specific external knowledge, and thus can easily migrate to other multi-modal tasks.
arXiv Detail & Related papers (2024-03-11T01:07:36Z) - Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications [90.6849884683226]
We study the challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data.
Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds.
We show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.
arXiv Detail & Related papers (2023-06-07T15:44:53Z) - Missing Modality Robustness in Semi-Supervised Multi-Modal Semantic
Segmentation [27.23513712371972]
We propose a simple yet efficient multi-modal fusion mechanism Linear Fusion.
We also propose M3L: Multi-modal Teacher for Masked Modality Learning.
Our proposal shows an absolute improvement of up to 10% on robust mIoU above the most competitive baselines.
arXiv Detail & Related papers (2023-04-21T05:52:50Z) - Align and Attend: Multimodal Summarization with Dual Contrastive Losses [57.83012574678091]
The goal of multimodal summarization is to extract the most important information from different modalities to form output summaries.
Existing methods fail to leverage the temporal correspondence between different modalities and ignore the intrinsic correlation between different samples.
We introduce Align and Attend Multimodal Summarization (A2Summ), a unified multimodal transformer-based model which can effectively align and attend the multimodal input.
arXiv Detail & Related papers (2023-03-13T17:01:42Z) - Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge
Graph Completion [112.27103169303184]
Multimodal Knowledge Graphs (MKGs) organize visual-text factual knowledge.
MKGformer can obtain SOTA performance on four datasets of multimodal link prediction, multimodal RE, and multimodal NER.
arXiv Detail & Related papers (2022-05-04T23:40:04Z) - MM-TTA: Multi-Modal Test-Time Adaptation for 3D Semantic Segmentation [104.48766162008815]
We propose and explore a new multi-modal extension of test-time adaptation for 3D semantic segmentation.
To design a framework that can take full advantage of multi-modality, each modality provides regularized self-supervisory signals to other modalities.
Our regularized pseudo labels produce stable self-learning signals in numerous multi-modal test-time adaptation scenarios.
arXiv Detail & Related papers (2022-04-27T02:28:12Z) - CLMLF:A Contrastive Learning and Multi-Layer Fusion Method for
Multimodal Sentiment Detection [24.243349217940274]
We propose a Contrastive Learning and Multi-Layer Fusion (CLMLF) method for multimodal sentiment detection.
Specifically, we first encode text and image to obtain hidden representations, and then use a multi-layer fusion module to align and fuse the token-level features of text and image.
In addition to the sentiment analysis task, we also designed two contrastive learning tasks, label based contrastive learning and data based contrastive learning tasks.
arXiv Detail & Related papers (2022-04-12T04:03:06Z) - Unsupervised Multimodal Language Representations using Convolutional
Autoencoders [5.464072883537924]
We propose extracting unsupervised Multimodal Language representations that are universal and can be applied to different tasks.
We map the word-level aligned multimodal sequences to 2-D matrices and then use Convolutional Autoencoders to learn embeddings by combining multiple datasets.
It is also shown that our method is extremely lightweight and can be easily generalized to other tasks and unseen data with small performance drop and almost the same number of parameters.
arXiv Detail & Related papers (2021-10-06T18:28:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.