SimMMDG: A Simple and Effective Framework for Multi-modal Domain
Generalization
- URL: http://arxiv.org/abs/2310.19795v1
- Date: Mon, 30 Oct 2023 17:58:09 GMT
- Title: SimMMDG: A Simple and Effective Framework for Multi-modal Domain
Generalization
- Authors: Hao Dong, Ismail Nejjar, Han Sun, Eleni Chatzi, Olga Fink
- Abstract summary: SimMMDG is a framework to overcome the challenges of achieving domain generalization in multi-modal scenarios.
We employ supervised contrastive learning on the modality-shared features to ensure they possess joint properties and impose distance constraints.
Our framework is theoretically well-supported and achieves strong performance in multi-modal DG on the EPIC-Kitchens dataset and the novel Human-Animal-Cartoon dataset.
- Score: 13.456240733175767
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In real-world scenarios, achieving domain generalization (DG) presents
significant challenges as models are required to generalize to unknown target
distributions. Generalizing to unseen multi-modal distributions poses even
greater difficulties due to the distinct properties exhibited by different
modalities. To overcome the challenges of achieving domain generalization in
multi-modal scenarios, we propose SimMMDG, a simple yet effective multi-modal
DG framework. We argue that mapping features from different modalities into the
same embedding space impedes model generalization. To address this, we propose
splitting the features within each modality into modality-specific and
modality-shared components. We employ supervised contrastive learning on the
modality-shared features to ensure they possess joint properties and impose
distance constraints on modality-specific features to promote diversity. In
addition, we introduce a cross-modal translation module to regularize the
learned features, which can also be used for missing-modality generalization.
We demonstrate that our framework is theoretically well-supported and achieves
strong performance in multi-modal DG on the EPIC-Kitchens dataset and the novel
Human-Animal-Cartoon (HAC) dataset introduced in this paper. Our source code
and HAC dataset are available at https://github.com/donghao51/SimMMDG.
Related papers
- Advances in Multimodal Adaptation and Generalization: From Traditional Approaches to Foundation Models [43.5468667825864]
This survey provides the first comprehensive review of advances from traditional approaches to foundation models.
It covers: (1) Multimodal domain adaptation; (2) Multimodal test-time adaptation; (3) Multimodal domain generalization; (4) Domain adaptation and generalization with the help of multimodal foundation models; and (5) Adaptation of multimodal foundation models.
arXiv Detail & Related papers (2025-01-30T18:59:36Z) - Towards Modality Generalization: A Benchmark and Prospective Analysis [56.84045461854789]
This paper introduces Modality Generalization (MG), which focuses on enabling models to generalize to unseen modalities.
We propose a comprehensive benchmark featuring multi-modal algorithms and adapt existing methods that focus on generalization.
Our work provides a foundation for advancing robust and adaptable multi-modal models, enabling them to handle unseen modalities in realistic scenarios.
arXiv Detail & Related papers (2024-12-24T08:38:35Z) - Towards Multimodal Open-Set Domain Generalization and Adaptation through Self-supervision [9.03028904066824]
We introduce a novel approach to address Multimodal Open-Set Domain Generalization for the first time, utilizing self-supervision.
We propose two innovative multimodal self-supervised pretext tasks: Masked Cross-modal Translation and Multimodal Jigsaw Puzzles.
We extend our approach to tackle also the Multimodal Open-Set Domain Adaptation problem, especially in scenarios where unlabeled data from the target domain is available.
arXiv Detail & Related papers (2024-07-01T17:59:09Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - All in One Framework for Multimodal Re-identification in the Wild [58.380708329455466]
multimodal learning paradigm for ReID introduced, referred to as All-in-One (AIO)
AIO harnesses a frozen pre-trained big model as an encoder, enabling effective multimodal retrieval without additional fine-tuning.
Experiments on cross-modal and multimodal ReID reveal that AIO not only adeptly handles various modal data but also excels in challenging contexts.
arXiv Detail & Related papers (2024-05-08T01:04:36Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Deep Multimodal Fusion for Generalizable Person Re-identification [15.250738959921872]
DMF is a Deep Multimodal Fusion network for the general scenarios on person re-identification task.
Rich semantic knowledge is introduced to assist in feature representation learning during the pre-training stage.
A realistic dataset is adopted to fine-tine the pre-trained model for distribution alignment with real-world.
arXiv Detail & Related papers (2022-11-02T07:42:48Z) - Exploiting modality-invariant feature for robust multimodal emotion
recognition with missing modalities [76.08541852988536]
We propose to use invariant features for a missing modality imagination network (IF-MMIN)
We show that the proposed model outperforms all baselines and invariantly improves the overall emotion recognition performance under uncertain missing-modality conditions.
arXiv Detail & Related papers (2022-10-27T12:16:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.