Towards Modality Generalization: A Benchmark and Prospective Analysis
- URL: http://arxiv.org/abs/2412.18277v1
- Date: Tue, 24 Dec 2024 08:38:35 GMT
- Title: Towards Modality Generalization: A Benchmark and Prospective Analysis
- Authors: Xiaohao Liu, Xiaobo Xia, Zhuo Huang, Tat-Seng Chua,
- Abstract summary: This paper introduces Modality Generalization (MG), which focuses on enabling models to generalize to unseen modalities.
We propose a comprehensive benchmark featuring multi-modal algorithms and adapt existing methods that focus on generalization.
Our work provides a foundation for advancing robust and adaptable multi-modal models, enabling them to handle unseen modalities in realistic scenarios.
- Score: 56.84045461854789
- License:
- Abstract: Multi-modal learning has achieved remarkable success by integrating information from various modalities, achieving superior performance in tasks like recognition and retrieval compared to uni-modal approaches. However, real-world scenarios often present novel modalities that are unseen during training due to resource and privacy constraints, a challenge current methods struggle to address. This paper introduces Modality Generalization (MG), which focuses on enabling models to generalize to unseen modalities. We define two cases: weak MG, where both seen and unseen modalities can be mapped into a joint embedding space via existing perceptors, and strong MG, where no such mappings exist. To facilitate progress, we propose a comprehensive benchmark featuring multi-modal algorithms and adapt existing methods that focus on generalization. Extensive experiments highlight the complexity of MG, exposing the limitations of existing methods and identifying key directions for future research. Our work provides a foundation for advancing robust and adaptable multi-modal models, enabling them to handle unseen modalities in realistic scenarios.
Related papers
- Advances in Multimodal Adaptation and Generalization: From Traditional Approaches to Foundation Models [43.5468667825864]
This survey provides the first comprehensive review of advances from traditional approaches to foundation models.
It covers: (1) Multimodal domain adaptation; (2) Multimodal test-time adaptation; (3) Multimodal domain generalization; (4) Domain adaptation and generalization with the help of multimodal foundation models; and (5) Adaptation of multimodal foundation models.
arXiv Detail & Related papers (2025-01-30T18:59:36Z) - Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework [58.362064122489166]
This paper introduces the Cross-modal Few-Shot Learning task, which aims to recognize instances from multiple modalities when only a few labeled examples are available.
We propose a Generative Transfer Learning framework consisting of two stages: the first involves training on abundant unimodal data, and the second focuses on transfer learning to adapt to novel data.
Our finds demonstrate that GTL has superior performance compared to state-of-the-art methods across four distinct multi-modal datasets.
arXiv Detail & Related papers (2024-10-14T16:09:38Z) - Missing Modality Prediction for Unpaired Multimodal Learning via Joint Embedding of Unimodal Models [6.610033827647869]
In real-world scenarios, consistently acquiring complete multimodal data presents significant challenges.
This often leads to the issue of missing modalities, where data for certain modalities are absent.
We propose a novel framework integrating parameter-efficient fine-tuning of unimodal pretrained models with a self-supervised joint-embedding learning method.
arXiv Detail & Related papers (2024-07-17T14:44:25Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - SimMMDG: A Simple and Effective Framework for Multi-modal Domain
Generalization [13.456240733175767]
SimMMDG is a framework to overcome the challenges of achieving domain generalization in multi-modal scenarios.
We employ supervised contrastive learning on the modality-shared features to ensure they possess joint properties and impose distance constraints.
Our framework is theoretically well-supported and achieves strong performance in multi-modal DG on the EPIC-Kitchens dataset and the novel Human-Animal-Cartoon dataset.
arXiv Detail & Related papers (2023-10-30T17:58:09Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - MMG-Ego4D: Multi-Modal Generalization in Egocentric Action Recognition [73.80088682784587]
"Multimodal Generalization" (MMG) aims to study how systems can generalize when data from certain modalities is limited or even completely missing.
MMG consists of two novel scenarios, designed to support security, and efficiency considerations in real-world applications.
New fusion module with modality dropout training, contrastive-based alignment training, and a novel cross-modal loss for better few-shot performance.
arXiv Detail & Related papers (2023-05-12T03:05:40Z) - Deep Multimodal Fusion for Generalizable Person Re-identification [15.250738959921872]
DMF is a Deep Multimodal Fusion network for the general scenarios on person re-identification task.
Rich semantic knowledge is introduced to assist in feature representation learning during the pre-training stage.
A realistic dataset is adopted to fine-tine the pre-trained model for distribution alignment with real-world.
arXiv Detail & Related papers (2022-11-02T07:42:48Z) - Exploiting modality-invariant feature for robust multimodal emotion
recognition with missing modalities [76.08541852988536]
We propose to use invariant features for a missing modality imagination network (IF-MMIN)
We show that the proposed model outperforms all baselines and invariantly improves the overall emotion recognition performance under uncertain missing-modality conditions.
arXiv Detail & Related papers (2022-10-27T12:16:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.