Related papers: Towards Multimodal Open-Set Domain Generalization and Adaptation through Self-supervision

Towards Multimodal Open-Set Domain Generalization and Adaptation through Self-supervision

URL: http://arxiv.org/abs/2407.01518v1
Date: Mon, 1 Jul 2024 17:59:09 GMT
Title: Towards Multimodal Open-Set Domain Generalization and Adaptation through Self-supervision
Authors: Hao Dong, Eleni Chatzi, Olga Fink,
Abstract summary: We introduce a novel approach to address Multimodal Open-Set Domain Generalization for the first time, utilizing self-supervision. We propose two innovative multimodal self-supervised pretext tasks: Masked Cross-modal Translation and Multimodal Jigsaw Puzzles. We extend our approach to tackle also the Multimodal Open-Set Domain Adaptation problem, especially in scenarios where unlabeled data from the target domain is available.
Score: 9.03028904066824
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The task of open-set domain generalization (OSDG) involves recognizing novel classes within unseen domains, which becomes more challenging with multiple modalities as input. Existing works have only addressed unimodal OSDG within the meta-learning framework, without considering multimodal scenarios. In this work, we introduce a novel approach to address Multimodal Open-Set Domain Generalization (MM-OSDG) for the first time, utilizing self-supervision. To this end, we introduce two innovative multimodal self-supervised pretext tasks: Masked Cross-modal Translation and Multimodal Jigsaw Puzzles. These tasks facilitate the learning of multimodal representative features, thereby enhancing generalization and open-class detection capabilities. Additionally, we propose a novel entropy weighting mechanism to balance the loss across different modalities. Furthermore, we extend our approach to tackle also the Multimodal Open-Set Domain Adaptation (MM-OSDA) problem, especially in scenarios where unlabeled data from the target domain is available. Extensive experiments conducted under MM-OSDG, MM-OSDA, and Multimodal Closed-Set DG settings on the EPIC-Kitchens and HAC datasets demonstrate the efficacy and versatility of the proposed approach. Our source code is available at https://github.com/donghao51/MOOSA.

Related papers

Open-set Cross Modal Generalization via Multimodal Unified Representation [40.283719790625646]
This paper extends Cross Modal Generalization (CMG) to open-set environments.<n>It addresses the limitations of prior closed-set cross-modal evaluations.<n>We propose MICU, comprising two key components: Fine-Coarse Masked multimodal InfoNCE and Cross modal Unified Jigsaw Puzzles.
arXiv Detail & Related papers (2025-07-20T12:09:19Z)
Bridging Domain Generalization to Multimodal Domain Generalization via Unified Representations [43.07575348801021]
Domain Generalization (DG) aims to enhance model robustness in unseen or distributionally shifted target domains through training exclusively on source domains.<n>A key challenge in Multi-modal Domain Generalization (MMDG) has emerged: enabling models trained on multi-modal sources to generalize to unseen target distributions within the same modality set.<n>We propose a novel approach that leverages Unified Representations to map different paired modalities together.
arXiv Detail & Related papers (2025-07-04T05:17:32Z)
Progressive Modality Cooperation for Multi-Modality Domain Adaptation [70.26879294371052]
We propose a new generic multi-modality domain adaptation framework called Progressive Modality Cooperation (PMC)<n>Under the MMDA setting, the samples in both domains have all the modalities.<n>Under the MMDA-PI setting, some modalities are missing in the target domain.
arXiv Detail & Related papers (2025-06-24T05:13:56Z)
Advances in Multimodal Adaptation and Generalization: From Traditional Approaches to Foundation Models [43.5468667825864]
This survey provides the first comprehensive review of advances from traditional approaches to foundation models. It covers: (1) Multimodal domain adaptation; (2) Multimodal test-time adaptation; (3) Multimodal domain generalization; (4) Domain adaptation and generalization with the help of multimodal foundation models; and (5) Adaptation of multimodal foundation models.
arXiv Detail & Related papers (2025-01-30T18:59:36Z)
Towards Modality Generalization: A Benchmark and Prospective Analysis [56.84045461854789]
This paper introduces Modality Generalization (MG), which focuses on enabling models to generalize to unseen modalities. We propose a comprehensive benchmark featuring multi-modal algorithms and adapt existing methods that focus on generalization. Our work provides a foundation for advancing robust and adaptable multi-modal models, enabling them to handle unseen modalities in realistic scenarios.
arXiv Detail & Related papers (2024-12-24T08:38:35Z)
From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons [85.99268361356832]
We introduce a process of adapting an MLLM to a Generalist Embodied Agent (GEA) GEA is a single unified model capable of grounding itself across varied domains through a multi-embodiment action tokenizer. Our findings reveal the importance of training with cross-domain data and online RL for building generalist agents.
arXiv Detail & Related papers (2024-12-11T15:06:25Z)
COSMo: CLIP Talks on Open-Set Multi-Target Domain Adaptation [24.46473228191582]
Multi-Target Domain Adaptation involves learning domain-invariant information from a single source domain and applying it to multiple unlabeled target domains. This paper introduces COSMo, a novel method that learns domain-agnostic prompts through source domain-guided prompt learning. To the best of our knowledge, COSMo is the first method to address Open-Set Multi-Target DA.
arXiv Detail & Related papers (2024-08-31T09:14:54Z)
One Framework to Rule Them All: Unifying Multimodal Tasks with LLM Neural-Tuning [16.96824902454355]
We propose a unified framework that concurrently handles multiple tasks and modalities. In this framework, all modalities and tasks are represented as unified tokens and trained using a single, consistent approach. We present a new benchmark, MMUD, which includes samples annotated with multiple task labels. We demonstrate the ability to handle multiple tasks simultaneously in a streamlined and efficient manner.
arXiv Detail & Related papers (2024-08-06T07:19:51Z)
Multimodal Instruction Tuning with Conditional Mixture of LoRA [54.65520214291653]
This paper introduces a novel approach that integrates multimodal instruction tuning with Low-Rank Adaption (LoRA) It innovates upon LoRA by dynamically constructing low-rank adaptation matrices tailored to the unique demands of each input instance. Experimental results on various multimodal evaluation datasets indicate that MixLoRA not only outperforms the conventional LoRA with the same or even higher ranks.
arXiv Detail & Related papers (2024-02-24T20:15:31Z)
Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding. We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL. UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z)
SimMMDG: A Simple and Effective Framework for Multi-modal Domain Generalization [13.456240733175767]
SimMMDG is a framework to overcome the challenges of achieving domain generalization in multi-modal scenarios. We employ supervised contrastive learning on the modality-shared features to ensure they possess joint properties and impose distance constraints. Our framework is theoretically well-supported and achieves strong performance in multi-modal DG on the EPIC-Kitchens dataset and the novel Human-Animal-Cartoon dataset.
arXiv Detail & Related papers (2023-10-30T17:58:09Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
Generalizing Multimodal Variational Methods to Sets [35.69942798534849]
This paper presents a novel variational method on sets called the Set Multimodal VAE (SMVAE) for learning a multimodal latent space. By modeling the joint-modality posterior distribution directly, the proposed SMVAE learns to exchange information between multiple modalities and compensate for the drawbacks caused by factorization.
arXiv Detail & Related papers (2022-12-19T23:50:19Z)
Multi-Prompt Alignment for Multi-Source Unsupervised Domain Adaptation [86.02485817444216]
We introduce Multi-Prompt Alignment (MPA), a simple yet efficient framework for multi-source UDA. MPA denoises the learned prompts through an auto-encoding process and aligns them by maximizing the agreement of all the reconstructed prompts. Experiments show that MPA achieves state-of-the-art results on three popular datasets with an impressive average accuracy of 54.1% on DomainNet.
arXiv Detail & Related papers (2022-09-30T03:40:10Z)
INDIGO: Intrinsic Multimodality for Domain Generalization [26.344372409315177]
We study how multimodal information can be leveraged in an "intrinsic" way to make systems generalize under unseen domains. We propose IntriNsic multimodality for DomaIn GeneralizatiOn (INDIGO)
arXiv Detail & Related papers (2022-06-13T05:41:09Z)
A Novel Mix-normalization Method for Generalizable Multi-source Person Re-identification [49.548815417844786]
Person re-identification (Re-ID) has achieved great success in the supervised scenario. It is difficult to directly transfer the supervised model to arbitrary unseen domains due to the model overfitting to the seen source domains. We propose MixNorm, which consists of domain-aware mix-normalization (DMN) and domain-ware center regularization (DCR)
arXiv Detail & Related papers (2022-01-24T18:09:38Z)
META: Mimicking Embedding via oThers' Aggregation for Generalizable Person Re-identification [68.39849081353704]
Domain generalizable (DG) person re-identification (ReID) aims to test across unseen domains without access to the target domain data at training time. This paper presents a new approach called Mimicking Embedding via oThers' Aggregation (META) for DG ReID.
arXiv Detail & Related papers (2021-12-16T08:06:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.