One-stage Modality Distillation for Incomplete Multimodal Learning
- URL: http://arxiv.org/abs/2309.08204v1
- Date: Fri, 15 Sep 2023 07:12:27 GMT
- Title: One-stage Modality Distillation for Incomplete Multimodal Learning
- Authors: Shicai Wei, Yang Luo, Chunbo Luo
- Abstract summary: This paper presents a one-stage modality distillation framework that unifies the privileged knowledge transfer and modality information fusion.
The proposed framework can overcome the problem of incomplete modality input in various scenes and achieve state-of-the-art performance.
- Score: 7.791488931628906
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning based on multimodal data has attracted increasing interest recently.
While a variety of sensory modalities can be collected for training, not all of
them are always available in development scenarios, which raises the challenge
to infer with incomplete modality. To address this issue, this paper presents a
one-stage modality distillation framework that unifies the privileged knowledge
transfer and modality information fusion into a single optimization procedure
via multi-task learning. Compared with the conventional modality distillation
that performs them independently, this helps to capture the valuable
representation that can assist the final model inference directly.
Specifically, we propose the joint adaptation network for the modality transfer
task to preserve the privileged information. This addresses the representation
heterogeneity caused by input discrepancy via the joint distribution
adaptation. Then, we introduce the cross translation network for the modality
fusion task to aggregate the restored and available modality features. It
leverages the parameters-sharing strategy to capture the cross-modal cues
explicitly. Extensive experiments on RGB-D classification and segmentation
tasks demonstrate the proposed multimodal inheritance framework can overcome
the problem of incomplete modality input in various scenes and achieve
state-of-the-art performance.
Related papers
- Enhancing Unimodal Latent Representations in Multimodal VAEs through Iterative Amortized Inference [20.761803725098005]
Multimodal variational autoencoders (VAEs) aim to capture shared latent representations by integrating information from different data modalities.
A significant challenge is accurately inferring representations from any subset of modalities without training an impractical number of inference networks for all possible modality combinations.
We introduce multimodal iterative amortized inference, an iterative refinement mechanism within the multimodal VAE framework.
arXiv Detail & Related papers (2024-10-15T08:49:38Z) - Missing Modality Prediction for Unpaired Multimodal Learning via Joint Embedding of Unimodal Models [6.610033827647869]
In real-world scenarios, consistently acquiring complete multimodal data presents significant challenges.
This often leads to the issue of missing modalities, where data for certain modalities are absent.
We propose a novel framework integrating parameter-efficient fine-tuning of unimodal pretrained models with a self-supervised joint-embedding learning method.
arXiv Detail & Related papers (2024-07-17T14:44:25Z) - All in One Framework for Multimodal Re-identification in the Wild [58.380708329455466]
multimodal learning paradigm for ReID introduced, referred to as All-in-One (AIO)
AIO harnesses a frozen pre-trained big model as an encoder, enabling effective multimodal retrieval without additional fine-tuning.
Experiments on cross-modal and multimodal ReID reveal that AIO not only adeptly handles various modal data but also excels in challenging contexts.
arXiv Detail & Related papers (2024-05-08T01:04:36Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - Incomplete Multimodal Learning for Remote Sensing Data Fusion [12.822457129596824]
The mechanism of connecting multimodal signals through self-attention operation is a key factor in the success of multimodal Transformer networks in remote sensing data fusion tasks.
Traditional approaches assume access to all modalities during both training and inference, which can lead to severe degradation when dealing with modal-incomplete inputs in downstream applications.
Our proposed approach introduces a novel model for incomplete multimodal learning in the context of remote sensing data fusion.
arXiv Detail & Related papers (2023-04-22T12:16:52Z) - Exploiting modality-invariant feature for robust multimodal emotion
recognition with missing modalities [76.08541852988536]
We propose to use invariant features for a missing modality imagination network (IF-MMIN)
We show that the proposed model outperforms all baselines and invariantly improves the overall emotion recognition performance under uncertain missing-modality conditions.
arXiv Detail & Related papers (2022-10-27T12:16:25Z) - CLIP-Driven Fine-grained Text-Image Person Re-identification [50.94827165464813]
TIReID aims to retrieve the image corresponding to the given text query from a pool of candidate images.
We propose a CLIP-driven Fine-grained information excavation framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID.
arXiv Detail & Related papers (2022-10-19T03:43:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.