Semi-Supervised Multi-Modal Multi-Instance Multi-Label Deep Network with
Optimal Transport
- URL: http://arxiv.org/abs/2104.08489v1
- Date: Sat, 17 Apr 2021 09:18:28 GMT
- Title: Semi-Supervised Multi-Modal Multi-Instance Multi-Label Deep Network with
Optimal Transport
- Authors: Yang Yang, Zhao-Yang Fu, De-Chuan Zhan, Zhi-Bin Liu, and Yuan Jiang
- Abstract summary: We propose a novel Multi-modal Multi-instance Multi-label Deep Network (M3DN)
M3DN considers M3 learning in an end-to-end multi-modal deep network and utilizes consistency principle among different modal bag-level predictions.
Thereby M3DNS can better predict label and exploit label correlation simultaneously.
- Score: 24.930976128926314
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Complex objects are usually with multiple labels, and can be represented by
multiple modal representations, e.g., the complex articles contain text and
image information as well as multiple annotations. Previous methods assume that
the homogeneous multi-modal data are consistent, while in real applications,
the raw data are disordered, e.g., the article constitutes with variable number
of inconsistent text and image instances. Therefore, Multi-modal Multi-instance
Multi-label (M3) learning provides a framework for handling such task and has
exhibited excellent performance. However, M3 learning is facing two main
challenges: 1) how to effectively utilize label correlation; 2) how to take
advantage of multi-modal learning to process unlabeled instances. To solve
these problems, we first propose a novel Multi-modal Multi-instance Multi-label
Deep Network (M3DN), which considers M3 learning in an end-to-end multi-modal
deep network and utilizes consistency principle among different modal bag-level
predictions. Based on the M3DN, we learn the latent ground label metric with
the optimal transport. Moreover, we introduce the extrinsic unlabeled
multi-modal multi-instance data, and propose the M3DNS, which considers the
instance-level auto-encoder for single modality and modified bag-level optimal
transport to strengthen the consistency among modalities. Thereby M3DNS can
better predict label and exploit label correlation simultaneously. Experiments
on benchmark datasets and real world WKG Game-Hub dataset validate the
effectiveness of the proposed methods.
Related papers
- MMBind: Unleashing the Potential of Distributed and Heterogeneous Data for Multimodal Learning in IoT [11.884646027921173]
We propose MMBind, a new framework for multimodal learning on distributed and heterogeneous IoT data.
We demonstrate that data of different modalities observing similar events, even captured at different times and locations, can be effectively used for multimodal training.
arXiv Detail & Related papers (2024-11-18T23:34:07Z) - Multimodality Helps Few-Shot 3D Point Cloud Semantic Segmentation [61.91492500828508]
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal support samples.
We introduce a cost-free multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality.
We propose a simple yet effective Test-time Adaptive Cross-modal Seg (TACC) technique to mitigate training bias.
arXiv Detail & Related papers (2024-10-29T19:28:41Z) - Meta-Learn Unimodal Signals with Weak Supervision for Multimodal Sentiment Analysis [25.66434557076494]
We propose a novel meta uni-label generation (MUG) framework to address the above problem.
We first design a contrastive-based projection module to bridge the gap between unimodal and multimodal representations.
We then propose unimodal and multimodal denoising tasks to train MUCN with explicit supervision via a bi-level optimization strategy.
arXiv Detail & Related papers (2024-08-28T03:43:01Z) - Adapting Segment Anything Model to Multi-modal Salient Object Detection with Semantic Feature Fusion Guidance [15.435695491233982]
We propose a novel framework to explore and exploit the powerful feature representation and zero-shot generalization ability of the Segment Anything Model (SAM) for multi-modal salient object detection (SOD)
We develop underlineSAM with seunderlinemantic funderlineeature fuunderlinesion guidancunderlinee (Sammese)
In the image encoder, a multi-modal adapter is proposed to adapt the single-modal SAM to multi-modal information. Specifically, in the mask decoder, a semantic-geometric
arXiv Detail & Related papers (2024-08-27T13:47:31Z) - OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces [67.07083389543799]
We present OmniBind, large-scale multimodal joint representation models ranging in scale from 7 billion to 30 billion parameters.
Due to the scarcity of data pairs across all modalities, instead of training large models from scratch, we propose remapping and binding the spaces of various pre-trained specialist models together.
Experiments demonstrate the versatility and superiority of OmniBind as an omni representation model, highlighting its great potential for diverse applications.
arXiv Detail & Related papers (2024-07-16T16:24:31Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications [90.6849884683226]
We study the challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data.
Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds.
We show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.
arXiv Detail & Related papers (2023-06-07T15:44:53Z) - Align and Attend: Multimodal Summarization with Dual Contrastive Losses [57.83012574678091]
The goal of multimodal summarization is to extract the most important information from different modalities to form output summaries.
Existing methods fail to leverage the temporal correspondence between different modalities and ignore the intrinsic correlation between different samples.
We introduce Align and Attend Multimodal Summarization (A2Summ), a unified multimodal transformer-based model which can effectively align and attend the multimodal input.
arXiv Detail & Related papers (2023-03-13T17:01:42Z) - MM-TTA: Multi-Modal Test-Time Adaptation for 3D Semantic Segmentation [104.48766162008815]
We propose and explore a new multi-modal extension of test-time adaptation for 3D semantic segmentation.
To design a framework that can take full advantage of multi-modality, each modality provides regularized self-supervisory signals to other modalities.
Our regularized pseudo labels produce stable self-learning signals in numerous multi-modal test-time adaptation scenarios.
arXiv Detail & Related papers (2022-04-27T02:28:12Z) - CLMLF:A Contrastive Learning and Multi-Layer Fusion Method for
Multimodal Sentiment Detection [24.243349217940274]
We propose a Contrastive Learning and Multi-Layer Fusion (CLMLF) method for multimodal sentiment detection.
Specifically, we first encode text and image to obtain hidden representations, and then use a multi-layer fusion module to align and fuse the token-level features of text and image.
In addition to the sentiment analysis task, we also designed two contrastive learning tasks, label based contrastive learning and data based contrastive learning tasks.
arXiv Detail & Related papers (2022-04-12T04:03:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.